图像

让数据库发挥作用

Making Databases Work

ACM 书籍

ACM Books

主编辑

Editor in Chief

M. Tamer Özsu,滑铁卢大学

M. Tamer Özsu, University of Waterloo

ACM Books 是面向计算机科学界的一系列新的高质量书籍,由 ACM 与 Morgan & Claypool Publishers 合作出版。ACM 图书出版物以印刷版和数字版形式通过书商广泛分发,并通过 ACM 数字图书馆平台分发给图书馆(和图书馆联盟)和 ACM 个人会员。

ACM Books is a new series of high-quality books for the computer science community, published by ACM in collaboration with Morgan & Claypool Publishers. ACM Books publications are widely distributed in both print and digital formats through booksellers and to libraries (and library consortia) and individual ACM members via the ACM Digital Library platform.

让数据库发挥作用:Michael Stonebraker 的务实智慧

Making Databases Work: The Pragmatic Wisdom of Michael Stonebraker

编辑:迈克尔·L·布罗迪

Editor: Michael L. Brodie

2018年

2018

多模态多传感器接口手册,第 2 卷:信号处理、架构以及情感和认知检测

The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition

编辑:Sharon Oviat,莫纳什大学

Editors: Sharon Oviatt, Monash University

Björn Schuller,奥格斯堡大学和伦敦帝国理工学院

Björn Schuller, University of Augsburg and Imperial College London

菲利普·R·科恩,莫纳什大学

Philip R. Cohen, Monash University

Daniel Sonntag,德国人工智能研究中心 (DFKI)

Daniel Sonntag, German Research Center for Artificial Intelligence (DFKI)

Gerasimos Potamianos,塞萨利大学

Gerasimos Potamianos, University of Thessaly

Antonio Krüger,萨尔大学和德国人工智能研究中心 (DFKI)

Antonio Krüger, Saarland University and German Research Center for Artificial Intelligence (DFKI)

2018年

2018

声明式逻辑编程:理论、系统和应用

Declarative Logic Programming: Theory, Systems, and Applications

编辑:Michael Kifer,石溪大学

Editors: Michael Kifer, Stony Brook University

艳红安妮刘,石溪大学

Yanhong Annie Liu, Stony Brook University

2018年

2018

稀疏傅里叶变换:理论与实践

The Sparse Fourier Transform: Theory and Practice

Haitham Hassanieh,伊利诺伊大学厄巴纳-香槟分校

Haitham Hassanieh, University of Illinois at Urbana-Champaign

2018年

2018

持续的军备竞赛:代码重用攻击和防御

The Continuing Arms Race: Code-Reuse Attacks and Defenses

编辑:Per Larsen,Immunant , Inc.

Editors: Per Larsen, Immunant, Inc.

Ahmad-Reza Sadeghi,达姆施塔特工业大学

Ahmad-Reza Sadeghi, Technische Universität Darmstadt

2018年

2018

多媒体研究前沿

Frontiers of Multimedia Research

编辑:Shih-Fu Chang,哥伦比亚大学

Editor: Shih-Fu Chang, Columbia University

2018年

2018

共享内存并行可以简单、快速且可扩展

Shared-Memory Parallelism Can Be Simple, Fast, and Scalable

Julian Shun,加州大学伯克利分校

Julian Shun, University of California, Berkeley

2017年

2017

从蛋白质相互作用网络计算预测蛋白质复合物

Computational Prediction of Protein Complexes from Protein Interaction Networks

Sriganesh Srihari,昆士兰大学分子生物科学研究所

Sriganesh Srihari, The University of Queensland Institute for Molecular Bioscience

Chern Han Yong,杜克大学-新加坡国立大学医学院

Chern Han Yong, Duke-National University of Singapore Medical School

Limsoon Wong,新加坡国立大学

Limsoon Wong, National University of Singapore

2017年

2017

多模态多传感器接口手册,第 1 卷:基础、用户建模和通用模态组合

The Handbook of Multimodal-Multisensor Interfaces, Volume 1: Foundations, User Modeling, and Common Modality Combinations

编辑:Sharon Oviat,Incaa Designs

Editors: Sharon Oviatt, Incaa Designs

Björn Schuller,帕绍大学和伦敦帝国学院

Björn Schuller, University of Passau and Imperial College London

Philip R. Cohen,Voicebox Technologies

Philip R. Cohen, Voicebox Technologies

Daniel Sonntag,德国人工智能研究中心 (DFKI)

Daniel Sonntag, German Research Center for Artificial Intelligence (DFKI)

Gerasimos Potamianos,塞萨利大学

Gerasimos Potamianos, University of Thessaly

Antonio Krüger,萨尔大学和德国人工智能研究中心 (DFKI)

Antonio Krüger, Saarland University and German Research Center for Artificial Intelligence (DFKI)

2017年

2017

计算社区:ACM 中的计算机科学与社会

Communities of Computing: Computer Science and Society in the ACM

Thomas J. Misa,明尼苏达大学编辑

Thomas J. Misa, Editor, University of Minnesota

2017年

2017

文本数据管理和分析:信息检索和文本挖掘的实用介绍

Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining

翟成祥,伊利诺伊大学厄巴纳-香槟分校

ChengXiang Zhai, University of Illinois at Urbana–Champaign

Sean Massung,伊利诺伊大学厄巴纳-香槟分校

Sean Massung, University of Illinois at Urbana–Champaign

2016年

2016

大型集群上快速通用数据处理的架构

An Architecture for Fast and General Data Processing on Large Clusters

马泰·扎哈里亚,斯坦福大学

Matei Zaharia, Stanford University

2016年

2016

响应式 Internet 编程:运行中的状态图 XML

Reactive Internet Programming: State Chart XML in Action

Franck Barbier,法国波城大学

Franck Barbier, University of Pau, France

2016年

2016

Agda 中经过验证的函数式编程

Verified Functional Programming in Agda

亚伦·斯坦普,爱荷华大学

Aaron Stump, The University of Iowa

2016年

2016

VR 书:以人为本的虚拟现实设计

The VR Book: Human-Centered Design for Virtual Reality

Jason Jerald,NextGen Interactions

Jason Jerald, NextGen Interactions

2016年

2016

艾达的遗产:从维多利亚时代到数字时代的计算文化

Ada’s Legacy: Cultures of Computing from the Victorian to the Digital Age

罗宾·哈默曼,史蒂文斯理工学院

Robin Hammerman, Stevens Institute of Technology

安德鲁·拉塞尔,史蒂文斯理工学院

Andrew L. Russell, Stevens Institute of Technology

2016年

2016

埃德蒙·伯克利和计算机专业人员的社会责任

Edmund Berkeley and the Social Responsibility of Computer Professionals

伯纳黛特·隆戈,新泽西理工学院

Bernadette Longo, New Jersey Institute of Technology

2015年

2015

候选多线性映射

Candidate Multilinear Maps

桑贾姆·加尔格,加州大学伯克利分校

Sanjam Garg, University of California, Berkeley

2015年

2015

比他们的机器更聪明:交互式计算先驱的口述历史

Smarter Than Their Machines: Oral Histories of Pioneers in Interactive Computing

约翰·卡利南,东北大学;莫萨瓦尔-拉赫马尼商业中心

John Cullinane, Northeastern University; Mossavar-Rahmani Center for Business

哈佛大学约翰肯尼迪政府学院和政府

and Government, John F. Kennedy School of Government, Harvard University

2015年

2015

通过视频游戏进行科学发现的框架

A Framework for Scientific Discovery through Video Games

塞思·库珀,华盛顿大学

Seth Cooper, University of Washington

2014年

2014

信任扩展作为商品计算机上安全代码执行的机制

Trust Extension as a Mechanism for Secure Code Execution on Commodity Computers

布莱恩·杰弗里·帕诺,微软研究院

Bryan Jeffrey Parno, Microsoft Research

2014年

2014

拥抱无线系统中的干扰

Embracing Interference in Wireless Systems

希亚姆纳特·戈拉科塔,华盛顿大学

Shyamnath Gollakota, University of Washington

2014年

2014

图像

让数据库发挥作用

Making Databases Work

迈克尔·斯通布雷克的务实智慧

The Pragmatic Wisdom of Michael Stonebraker

迈克尔·L·布罗迪

Michael L. Brodie

麻省理工学院

Massachusetts Institute of Technology

ACM 书籍 #22

ACM Books #22

图像

版权所有 © 2019 计算机协会和 Morgan & Claypool Publishers

Copyright © 2019 by the Association for Computing Machinery and Morgan & Claypool Publishers

版权所有。未经作者事先许可,不得复制本出版物的任何部分、将其存储在检索系统中或以任何形式或任何方式(电子、机械、复印、录音或任何其他方式,印刷评论中的简短引用除外)传播。出版商。

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews—without the prior permission of the publisher.

公司用来区分其产品的名称通常被称为商标或注册商标。在 Morgan & Claypool 知晓索赔的所有情况下,产品名称均以首字母大写或全部大写字母出现。但是,读者应联系相应的公司以获取有关商标和注册的更完整信息。

Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks. In all instances in which Morgan & Claypool is aware of a claim, the product names appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration.

让数据库发挥作用:Michael Stonebraker 的务实智慧

Making Databases Work: The Pragmatic Wisdom of Michael Stonebraker

迈克尔·L·布罗迪,编辑

Michael L. Brodie, editor

books.acm.org

books.acm.org

www.morganclaypoolpublishers.com

www.morganclaypoolpublishers.com

ISBN: 978-1-94748-719-2 精装本

ISBN: 978-1-94748-719-2 hardcover

ISBN:978-1-94748-716-1 平装本

ISBN: 978-1-94748-716-1 paperback

ISBN: 978-1-94748-717-8 电子书

ISBN: 978-1-94748-717-8 eBook

ISBN: 978-1-94748-718-5 ePub

ISBN: 978-1-94748-718-5 ePub

系列 ISSN:2374-6769 印刷版 2374-6777 电子版

Series ISSN: 2374-6769 print 2374-6777 electronic

DOI:

DOIs:

10.1145/3226595 本书

10.1145/3226595 Book

10.1145/3226595.3226596前言/前言

10.1145/3226595.3226596 Foreword/Preface

10.1145/3226595.3226597简介

10.1145/3226595.3226597 Introduction

10.1145/3226595.3226598第一部分

10.1145/3226595.3226598 Part I

10.1145/3226595.3226599第二部分/第 1 章

10.1145/3226595.3226599 Part II/Chapter 1

10.1145/3226595.3226600第三部分/第 2 章

10.1145/3226595.3226600 Part III/Chapter 2

10.1145/3226595.3226601第四部分/第 3 章

10.1145/3226595.3226601 Part IV/Chapter 3

10.1145/3226595.3226602第 4 章

10.1145/3226595.3226602 Chapter 4

10.1145/3226595.3226603第5章

10.1145/3226595.3226603 Chapter 5

10.1145/3226595.3226604第6章

10.1145/3226595.3226604 Chapter 6

10.1145/3226595.3226605第五部分/第 7 章

10.1145/3226595.3226605 Part V/Chapter 7

10.1145/3226595.3226606第8章

10.1145/3226595.3226606 Chapter 8

10.1145/3226595.3226607第9章

10.1145/3226595.3226607 Chapter 9

10.1145/3226595.3226608第六部分/第 10 章

10.1145/3226595.3226608 Part VI/Chapter 10

10.1145/3226595.3226609第11章

10.1145/3226595.3226609 Chapter 11

10.1145/3226595.3226610第12章

10.1145/3226595.3226610 Chapter 12

10.1145/3226595.3226611第13章

10.1145/3226595.3226611 Chapter 13

10.1145/3226595.3226612第七部分/第 14 章

10.1145/3226595.3226612 Part VII/Chapter 14

10.1145/3226595.3226613第 VII.A 部分/第 15 章

10.1145/3226595.3226613 Part VII.A/Chapter 15

10.1145/3226595.3226614第16章

10.1145/3226595.3226614 Chapter 16

10.1145/3226595.3226615第17章

10.1145/3226595.3226615 Chapter 17

10.1145/3226595.3226616第18章

10.1145/3226595.3226616 Chapter 18

10.1145/3226595.3226617第19章

10.1145/3226595.3226617 Chapter 19

10.1145/3226595.3226618第20章

10.1145/3226595.3226618 Chapter 20

10.1145/3226595.3226619第21章

10.1145/3226595.3226619 Chapter 21

10.1145/3226595.3226620第22章

10.1145/3226595.3226620 Chapter 22

10.1145/3226595.3226621第23章

10.1145/3226595.3226621 Chapter 23

10.1145/3226595.3226622 VII.B 部分/第 24 章

10.1145/3226595.3226622 Part VII.B/Chapter 24

10.1145/3226595.3226623第25章

10.1145/3226595.3226623 Chapter 25

10.1145/3226595.3226624第26章

10.1145/3226595.3226624 Chapter 26

10.1145/3226595.3226625第27章

10.1145/3226595.3226625 Chapter 27

10.1145/3226595.3226626第28章

10.1145/3226595.3226626 Chapter 28

10.1145/3226595.3226627第29章

10.1145/3226595.3226627 Chapter 29

10.1145/3226595.3226628第30章

10.1145/3226595.3226628 Chapter 30

10.1145/3226595.3226629第31章

10.1145/3226595.3226629 Chapter 31

10.1145/3226595.3226630第八部分/第 32 章

10.1145/3226595.3226630 Part VIII/Chapter 32

10.1145/3226595.3226631第33章

10.1145/3226595.3226631 Chapter 33

10.1145/3226595.3226632第34章

10.1145/3226595.3226632 Chapter 34

10.1145/3226595.3226633第35章

10.1145/3226595.3226633 Chapter 35

10.1145/3226595.3226634第36章

10.1145/3226595.3226634 Chapter 36

10.1145/3226595.3226635第九部分/论文 1

10.1145/3226595.3226635 Part IX/Paper 1

10.1145/3226595.3226636论文 2

10.1145/3226595.3226636 Paper 2

10.1145/3226595.3226637论文 3

10.1145/3226595.3226637 Paper 3

10.1145/3226595.3226638论文 4

10.1145/3226595.3226638 Paper 4

10.1145/3226595.3226639论文 5

10.1145/3226595.3226639 Paper 5

10.1145/3226595.3226640论文 6

10.1145/3226595.3226640 Paper 6

10.1145/3226595.3226641作品集

10.1145/3226595.3226641 Collected Works

10.1145/3226595.3226642参考文献/索引/ Bios

10.1145/3226595.3226642 References/Index/Bios

ACM 图书系列中的出版物,#22

A publication in the ACM Books series, #22

主编:M. Tamer Özsu,滑铁卢大学

Editor in Chief: M. Tamer Özsu, University of Waterloo

本书使用 ZzT E X在 Arnhem Pro 10/14 和 Flama 中排版。

This book was typeset in Arnhem Pro 10/14 and Flama using ZzTEX.

第一版

First Edition

10 9 8 7 6 5 4 3 2 1

10  9  8  7  6  5  4  3  2  1

本书谨献给迈克尔·斯通布雷克 (Michael Stonebraker)、吉姆·格雷 (Jim Gray)、泰德·科德 (Ted Codd) 和查理·巴赫曼 (Charlie Bachman),他们因数据管理(世界上最有价值的资源之一)而获得 ACM AM 图灵奖,并献给他们的众多合作者,特别是本书的贡献者。体积

This book is dedicated to Michael Stonebraker, Jim Gray, Ted Codd, and Charlie Bachman, recipients of the ACM A.M. Turing Award for the management of data, one of the world’s most valuable resources, and to their many collaborators, particularly the contributors to this volume.

内容

Contents

数据管理技术 Kairometer:历史背景

Data Management Technology Kairometer: The Historical Context

前言

Foreword

前言

Preface

介绍

Introduction

迈克尔·L·布罗迪

Michael L. Brodie

数据库简史

A Brief History of Databases

准备阅读故事以及您可能会发现什么

Preparing to Read the Stories and What You Might Find There

软件系统课程旅行指南(共九部分)

A Travel Guide to Software Systems Lessons in Nine Parts

第一部分

PART I

2014 ACM AM 图灵奖论文和讲座

2014 ACM A.M. TURING AWARD PAPER AND LECTURE

陆鲨在 Squawk Box 上

The Land Sharks Are on the Squawk Box

迈克尔·斯通布雷克

Michael Stonebraker

良好的开端

Off to a Good Start

第一个减速带

First Speedbumps

另一个高点

Another High

高潮不会持续

The High Does Not Last

未来再次抬头

The Future Looks Up (Again)

好时光不会持续太久

The Good Times Do Not Last Long

故事结束

The Stories End

为什么要讲自行车故事?

Why a Bicycle Story?

当今时代

The Present Day

参考

References

第二部分

PART II

迈克·斯通布雷克的职业生涯

MIKE STONEBRAKER’S CAREER

第1章

Chapter 1

让它发生:迈克尔·斯通布雷克的一生

Make it Happen: The Life of Michael Stonebraker

塞缪尔·马登

Samuel Madden

概要

Synopsis

早期教育和教育

Early Years and Education

学术生涯和安格尔的诞生

Academic Career and the Birth of Ingres

后安格尔时代

The Post-Ingres Years

工业、麻省理工学院和新千年

Industry, MIT, and the New Millennium

碎石者的遗产

Stonebraker’s Legacy

公司

Companies

奖项与荣誉

Awards and Honors

服务

Service

宣传

Advocacy

个人生活

Personal Life

致谢

Acknowledgments

迈克·斯通布雷克的学生谱系图

Mike Stonebraker’s Student Genealogy Chart

迈克·斯通布雷克 (Mike Stonebraker) 的职业生涯:图表

The Career of Mike Stonebraker: The Chart

第三部分

PART III

迈克·斯通布雷克 (Mike Stonebraker) 畅所欲言:玛丽安·温斯莱特 (Marianne Winslett) 专访

MIKE STONEBRAKER SPEAKS OUT: AN INTERVIEW WITH MARIANNE WINSLETT

第2章

Chapter 2

迈克·斯通布雷克 (Mike Stonebraker) 畅所欲言:采访

Mike Stonebraker Speaks Out: An Interview

玛丽安·温斯莱特

Marianne Winslett

第四部分

PART IV

大局观

THE BIG PICTURE

第3章

Chapter 3

领导和宣传

Leadership and Advocacy

菲利普·伯恩斯坦

Philip A. Bernstein

系统

Systems

机制

Mechanisms

宣传

Advocacy

第4章

Chapter 4

观点:2014 年 ACM 图灵奖

Perspectives: The 2014 ACM Turing Award

詹姆斯·汉密尔顿

James Hamilton

第5章

Chapter 5

一个产业的诞生;图灵奖之路

Birth of an Industry; Path to the Turing Award

杰里·霍尔德

Jerry Held

一个行业的诞生(20 世纪 70 年代)

Birth of an Industry (1970s)

安格尔——计时

Ingres—Timing

安格尔——团队

Ingres—Team

安格尔——竞争

Ingres—Competition

安格尔——平台

Ingres—Platform

充满竞争的青春期(1980 年代和 1990 年代)

Adolescence with Competition (1980s and 1990s)

与甲骨文竞争

Competing with Oracle

(再次)与 Oracle 竞争

Competing with Oracle (Again)

成熟与多样性(2000 年代和 2010 年代)

Maturity with Variety (2000s and 2010s)

维蒂卡

Vertica

伏特数据库

VoltDB

塔姆尔

Tamr

底线

The Bottom Line

第6章

Chapter 6

从 50 年的角度看迈克

A Perspective of Mike from a 50-Year Vantage Point

大卫·J·德威特

David J. DeWitt

1970 年秋季——密歇根大学

Fall 1970—University of Michigan

1976 年秋季——威斯康星州

Fall 1976—Wisconsin

1983 年秋季——伯克利

Fall 1983—Berkeley

1988 年至 1995 年——Mike 没有绕道面向对象的 DBMS

1988–1995—No Object Oriented DBMS Detour for Mike

2000 年——红杉计划

2000—Project Sequoia

2003 年—CIDR 会议启动

2003—CIDR Conference Launch

2005 年——麻省理工学院休假

2005—Sabbatical at MIT

2008 年——我们发表有关“MapReduce”的博客

2008—We Blog about “MapReduce”

2014年——终于获得图灵奖

2014—Finally, a Turing Award

2016——我登陆麻省理工学院

2016—I Land at MIT

2017年

2017

第五部分

PART V

初创公司

STARTUPS

第7章

Chapter 7

如何通过五个(不太)简单的步骤创办一家公司

How to Start a Company in Five (Not So) Easy Steps

迈克尔·斯通布雷克

Michael Stonebraker

介绍

Introduction

第一步:有一个好主意

Step 1: Have a Good Idea

第 2 步:组建团队并构建原型

Step 2: Assemble a Team and Build a Prototype

第 3 步:寻找 Lighthouse 客户

Step 3: Find a Lighthouse Customer

第四步:招募成人监督

Step 4: Recruit Adult Supervision

第 5 步:准备推介材料并征求风险投资人

Step 5: Prepare a Pitch Deck and Solicit the VCs

评论

Comments

概括

Summary

第8章

Chapter 8

如何创建和运营 Stonebraker 初创公司——真实的故事

How to Create and Run a Stonebraker Startup—The Real Story

安迪·帕尔默

Andy Palmer

非凡的成就。非凡的贡献。

An Extraordinary Achievement. An Extraordinary Contribution.

共同利益的问题 快乐的发现

A Problem of Mutual Interest A Happy Discovery

伙伴关系的力量

The Power of Partnership

强烈的实用主义、坚定不移的清晰、无限的能量

Fierce Pragmatism, Unwavering Clarity, Boundless Energy

最后的观察:初创企业从根本上讲是关于人的

A Final Observation: Startups are Fundamentally about People

第9章

Chapter 9

让成年人加入进来:风险投资的视角

Getting Grownups in the Room: A VC Perspective

乔探戈

Jo Tango

我的第一次会议

My First Meeting

语境

Context

流库

StreamBase

剧本已定

A Playbook Is Set

迈克的价值观

Mike’s Values

尾声

A Coda

很好的一天

A Great Day

第六部分

PART VI

数据库系统研究

DATABASE SYSTEMS RESEARCH

第10章

Chapter 10

好想法从何而来以及如何利用它们

Where Good Ideas Come From and How to Exploit Them

迈克尔·斯通布雷克

Michael Stonebraker

介绍

Introduction

安格尔的诞生

The Birth of Ingres

抽象数据类型 (ADT)

Abstract Data Types (ADTs)

Postgres

Postgres

分布式 Ingres、Ingres*、Cohera 和 Morpheus

Distributed Ingres, Ingres*, Cohera, and Morpheus

并行数据库

Parallel Databases

数据仓库

Data Warehouses

H 存储/VoltDB

H-Store/VoltDB

数据驯服者

Data Tamer

如何利用创意

How to Exploit Ideas

结束观察

Closing Observations

第11章

Chapter 11

我们失败的地方

Where We Have Failed

迈克尔·斯通布雷克

Michael Stonebraker

三个失败

The Three Failures

我们的三次失败的后果

Consequences of Our Three Failures

概括

Summary

第12章

Chapter 12

Stonebraker 和开源

Stonebraker and Open Source

迈克·奥尔森

Mike Olson

BSD 许可证的起源

The Origins of the BSD License

BSD 和安格尔

BSD and Ingres

安格尔的影响

The Impact of Ingres

后安格尔时代

Post-Ingres

开源对研究的影响

The Impact of Open Source on Research

第13章

Chapter 13

关系数据库管理系统谱系

The Relational Database Management Systems Genealogy

菲利克斯·瑙曼

Felix Naumann

第七部分

PART VII

系统贡献

CONTRIBUTIONS BY SYSTEM

第14章

Chapter 14

迈克·斯通布雷克 (Mike Stonebraker) 的研究贡献:概述

Research Contributions of Mike Stonebraker: An Overview

塞缪尔·马登

Samuel Madden

与迈克交往的技术规则

Technical Rules of Engagement with Mike

迈克的技术贡献

Mike’s Technical Contributions

第七部分 A

PART VII.A

按系统划分的研究贡献

RESEARCH CONTRIBUTIONS BY SYSTEM

第15章

Chapter 15

安格尔晚年

The Later Ingres Years

迈克尔·J·凯里

Michael J. Carey

我是如何参加安格尔派对的

How I Ended Up at the Ingres Party

Ingres:实现(并共享!)关系 DBMS

Ingres: Realizing (and Sharing!) a Relational DBMS

分布式安格尔:一个很好,所以越多越好

Distributed Ingres: One Was Good, So More Must be Better

Ingres:超越业务数据

Ingres: Moving Beyond Business Data

第16章

Chapter 16

回顾 Postgres

Looking Back at Postgres

约瑟夫·海勒斯坦

Joseph M. Hellerstein

语境

Context

Postgres:概述

Postgres: An Overview

以日志为中心的存储和恢复

Log-centric Storage and Recovery

软件影响

Software Impact

教训

Lessons

致谢

Acknowledgments

第17章

Chapter 17

数据库迎接流处理时代

Databases Meet the Stream Processing Era

玛格达莱娜·巴拉津斯卡、斯坦·兹多尼克

Magdalena Balazinska, Stan Zdonik

Aurora 和 Borealis 项目的起源

Origins of the Aurora and Borealis Projects

Aurora 和 Borealis 流处理系统

The Aurora and Borealis Stream-Processing Systems

并发流处理工作

Concurrent Stream-Processing Efforts

创立 StreamBase 系统

Founding StreamBase Systems

当今的流处理

Stream Processing Today

致谢

Acknowledgments

第18章

Chapter 18

便利店:通过博士的视角 学生

C-Store: Through the Eyes of a Ph.D. Student

丹尼尔·J·阿巴迪

Daniel J. Abadi

我如何成为一名计算机科学家

How I Became a Computer Scientist

便利店的理念、演变和影响

The Idea, Evolution, and Impact of C-Store

与 Mike 一起建造便利店

Building C-Store with Mike

创立 Vertica 系统

Founding Vertica Systems

第19章

Chapter 19

内存中、水平和事务:H-Store OLTP DBMS 项目

In-Memory, Horizontal, and Transactional: The H-Store OLTP DBMS Project

安迪·帕夫洛

Andy Pavlo

系统架构概述

System Architecture Overview

第一个原型 (2006)

First Prototype (2006)

第二个原型(2007-2008)

Second Prototype (2007–2008)

VoltDB(2009 年至今)

VoltDB (2009–Present)

H-Store/VoltDB 拆分(2010–2016)

H-Store/VoltDB Split (2010–2016)

结论

Conclusion

第20章

Chapter 20

攀登高山:SciDB 和科学数据管理

Scaling Mountains: SciDB and Scientific Data Management

保罗·布朗

Paul Brown

选择你的山

Selecting Your Mountain

规划攀登

Planning the Climb

远征物流

Expedition Logistics

大本营

Base Camp

计划、山脉和高原反应

Plans, Mountains, and Altitude Sickness

在山顶

On Peaks

致谢

Acknowledgments

第21章

Chapter 21

大规模数据统一:Data Tamer

Data Unification at Scale: Data Tamer

伊哈布·伊利亚斯

Ihab Ilyas

我是如何参与其中的

How I Got Involved

Data Tamer:想法和原型

Data Tamer: The Idea and Prototype

公司:塔姆尔公司

The Company: Tamr Inc.

迈克的影响:三个教训。

Mike’s Influence: Three Lessons Learned.

第22章

Chapter 22

BigDAWG Polystore 系统

The BigDAWG Polystore System

蒂姆·马特森、珍妮·罗杰斯、亚伦·J·埃尔莫尔

Tim Mattson, Jennie Rogers, Aaron J. Elmore

大数据 ISTC

Big Data ISTC

BigDAWG 的起源

The Origins of BigDAWG

一种方法并不适用于所有情况以及对 Polystore 系统的追求

One Size Does Not Fit All and the Quest for Polystore Systems

把它们放在一起

Putting it All Together

查询建模和优化

Query Modeling and Optimization

数据移动

Data Movement

BigDAWG 版本和演示

BigDAWG Releases and Demos

结束语

Closing Thoughts

第23章

Chapter 23

Data Civilizer:对数据发现、集成和清理的端到端支持

Data Civilizer: End-to-End Support for Data Discovery, Integration, and Cleaning

穆拉德·乌扎尼、南唐、劳尔·卡斯特罗·费尔南德斯

Mourad Ouzzani, Nan Tang, Raul Castro Fernandez

我们需要使数据文明化

We Need to Civilize the Data

分析师的日常生活

The Day-to-Day Life of an Analyst

设计端到端系统

Designing an End-to-End System

数据文明者的挑战

Data Civilizer Challenges

结束语

Concluding Remarks

第七部分 B

PART VII.B

建筑系统的贡献

CONTRIBUTIONS FROM BUILDING SYSTEMS

第24章

Chapter 24

商业安格尔代码线

The Commercial Ingres Codeline

保罗·巴特沃斯、弗雷德·卡特

Paul Butterworth, Fred Carter

研究到商业

Research to Commercial

结论

Conclusions

开源安格尔

Open Source Ingres

第25章

Chapter 25

Postgres 和 Illustra 代码线

The Postgres and Illustra Codelines

魏红

Wei Hong

Postgres:学术原型

Postgres: The Academic Prototype

Illustra:“为了钱而做”

Illustra: “Doing It for Dollars”

PostgreSQL 及其他

PostgreSQL and Beyond

开源 PostgreSQL

Open Source PostgreSQL

最后的想法

Final Thoughts

第26章

Chapter 26

Aurora/Borealis/StreamBase 代码线:三个系统的故事

The Aurora/Borealis/ StreamBase Codelines: A Tale of Three Systems

内西姆·塔特布尔

Nesime Tatbul

Aurora/Borealis:流处理系统的黎明

Aurora/Borealis: The Dawn of Stream Processing Systems

从 10 万多行大学代码到商业产品

From 100K+ Lines of University Code to a Commercial Product

与 StreamBase 客户的邂逅

Encounters with StreamBase Customers

StreamBase 中的“Over My Dead Body”问题

“Over My Dead Body” Issues in StreamBase

愚人节笑话,还是下一个大创意?

An April Fool’s Day Joke, or the Next Big Idea?

结束语

Concluding Remarks

致谢

Acknowledgments

第27章

Chapter 27

Vertica 代码线

The Vertica Codeline

希尔帕·拉万德

Shilpa Lawande

从头开始构建数据库系统

Building a Database System from Scratch

代码遇见客户

Code Meets Customers

不要重新发明轮子(让它变得更好)

Don’t Reinvent the Wheel (Make It Better)

架构决策:研究与现实生活的结合

Architectural Decisions: Where Research Meets Real Life

客户:开发团队最重要的成员

Customers: The Most Important Members of the Dev Team

结论

Conclusion

致谢

Acknowledgments

第28章

Chapter 28

VoltDB 代码线

The VoltDB Codeline

约翰·哈格

John Hugg

压实

Compaction

潜伏

Latency

磁盘持久化

Disk Persistence

延迟减少

Latency Redux

结论

Conclusion

第29章

Chapter 29

SciDB 代码线:跨越鸿沟

The SciDB Codeline: Crossing the Chasm

克里蒂·森·夏尔马、亚历克斯·波利亚科夫、杰森·金钦

Kriti Sen Sharma, Alex Poliakov, Jason Kinchen

与他人相处融洽

Playing Well with Others

你不可能(一次)拥有一切

You Can’t Have Everything (at Once)

我们相信硬性数字

In Hard Numbers We Trust

语言问题

Language Matters

安全是一个持续的过程

Security is an Ongoing Process

为(基因组)数据洪流做好准备

Preparing for the (Genomic) Data Deluge

跨越鸿沟:从早期采用者到早期大众

Crossing the Chasm: From Early Adopters to Early Majority

第30章

Chapter 30

Tamr 代码线

The Tamr Codeline

尼古拉斯·贝茨故居

Nikolaus Bates-Haus

不伦不类

Neither Fish nor Fowl

驯服算法复杂性的野兽

Taming the Beast of Algorithmic Complexity

将用户放在首位和中心

Putting Users Front and Center

尊重多样性的规模

Scaling with Respect to Variety

结论

Conclusion

第31章

Chapter 31

BigDAWG 代码线

The BigDAWG Codeline

维杰·加德帕利

Vijay Gadepally

介绍

Introduction

BigDAWG 的起源

BigDAWG Origins

首次公开 BigDAWG 演示

First Public BigDAWG Demonstration

精炼 BigDAWG

Refining BigDAWG

BigDAWG 官方发布

BigDAWG Official Release

BigDAWG 未来

BigDAWG Future

第八部分

PART VIII

观点

PERSPECTIVES

第32章

Chapter 32

IBM 关系数据库代码库

IBM Relational Database Code Bases

詹姆斯·汉密尔顿

James Hamilton

为什么有四个代码库?

Why Four Code Bases?

可移植代码库出现

The Portable Code Base Emerges

期待

Looking Forward

第33章

Chapter 33

Aurum:一个关于研究品味的故事

Aurum: A Story about Research Taste

劳尔·卡斯特罗·费尔南德斯

Raul Castro Fernandez

第34章

Chapter 34

尼斯:或者成为迈克的学生是什么感觉

Nice: Or What It Was Like to Be Mike’s Student

马蒂·赫斯特

Marti Hearst

第35章

Chapter 35

迈克尔·斯通布雷克:竞争对手、合作者、朋友

Michael Stonebraker: Competitor, Collaborator, Friend

唐·哈德勒

Don Haderle

第36章

Chapter 36

数据库守护者的变化

The Changing of the Database Guard

迈克尔·L·布罗迪

Michael L. Brodie

与数据库专家共进晚餐

Dinner with the Database Cognoscenti

伟大关系-CODASYL 辩论

The Great Relational-CODASYL Debate

迈克:比辩论甚至奶酪更令人难忘

Mike: More Memorable than the Debate, and Even the Cheese

十年后:朋友还是敌人?

A Decade Later: Friend or Foe?

第九部分

PART IX

迈克尔·斯通布雷克 (Michael Stonebraker) 及其合作者的开创性作品

SEMINAL WORKS OF MICHAEL STONEBRAKER AND HIS COLLABORATORS

OLTP 概览,以及我们在那里发现的内容

OLTP Through the Looking Glass, and What We Found There

斯塔夫罗斯·哈里佐普洛斯、丹尼尔·J·阿巴迪、塞缪尔·马登、迈克尔·斯通布雷克

Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, Michael Stonebraker

抽象的

Abstract

1 简介

1      Introduction

2 OLTP 趋势

2      Trends in OLTP

3 岸边

3      Shore

4 性能研究

4      Performance Study

对未来 OLTP 引擎的 5 个启示

5      Implications for Future OLTP Engines

6 相关工作

6      Related Work

7 结论

7      Conclusions

8 致谢

8      Acknowledgments

9 重复性评估

9      Repeatability Assessment

参考

References

“一刀切”:一个时代已过的想法

“One Size Fits All”: An Idea Whose Time Has Come and Gone

迈克尔·斯通布雷克 (Michael Stonebraker) U ǧ ur Çetintemel

Michael Stonebraker Uǧur Çetintemel

抽象的

Abstract

1 简介

1      Introduction

2 数据仓库

2      Data Warehousing

3 流处理

3      Stream Processing

4 性能讨论

4      Performance Discussion

5 一种尺寸适合所有人?

5      One Size Fits All?

6 对保理的评论

6      A Comment on Factoring

7 结束语

7      Concluding Remarks

参考

References

建筑时代的终结(是时候彻底重写了)

The End of an Architectural Era (It’s Time for a Complete Rewrite)

迈克尔·斯通布雷克、塞缪尔·马登、丹尼尔·J·阿巴迪、斯塔夫罗斯·哈里佐普洛斯、纳比尔·哈切姆、帕特·海兰

Michael Stonebraker, Samuel Madden, Daniel J. Abadi, Stavros Harizopoulos, Nabil Hachem, Pat Helland

抽象的

Abstract

1 简介

1      Introduction

2 OLTP设计注意事项

2      OLTP Design Considerations

3 交易、处理和环境假设

3      Transaction, Processing and Environment Assumptions

4 H店草图

4      H-Store Sketch

5 性能比较

5      A Performance Comparison

6 关于“一刀切”世界的一些评论

6      Some Comments about a “One Size Does Not Fit All” World

7 总结和未来工作

7      Summary and Future Work

参考

References

C-Store:面向列的 DBMS

C-Store: A Column-Oriented DBMS

迈克·斯通布雷克 (Mike Stonebraker)、丹尼尔·J·阿巴迪 (Daniel J. Abadi)、亚当·巴特金 (Adam Batkin)、陈学东 (Xuedong Chen)、米奇·切尔尼亚克 (Mitch Cherniack)、米格尔·费雷拉 (Miguel Ferreira)、刘爱德蒙 (Edmond Lau)、艾默生·林 (Amerson Lin)、萨姆·马登 (Sam Madden)、伊丽莎白·奥尼尔 (Elizabeth O'Neil)、帕特·奥尼尔 (Pat O'Neil)、亚历克斯·拉辛 (Alex Rasin)、Nga Tran、斯坦·兹多尼克 (Stan Zdonik)

Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O’Neil, Pat O’Neil, Alex Rasin, Nga Tran, Stan Zdonik

抽象的

Abstract

1 简介

1      Introduction

2 数据模型

2      Data Model

3 RS

3      RS

4 WS

4      WS

5 存储管理

5      Storage Management

6 更新和交易

6      Updates and Transactions

7 元组移动器

7      Tuple Mover

8 C存储查询执行

8      C-Store Query Execution

9 性能比较

9      Performance Comparison

10 相关工作

10    Related Work

11 结论

11    Conclusions

致谢和参考文献

Acknowledgements and References

POSTGRES 的实施

The Implementation of POSTGRES

迈克尔·斯通布雷克、劳伦斯·A·罗、迈克尔·广滨

Michael Stonebraker, Lawrence A. Rowe, Michael Hirohama

一、简介

I    Introduction

II POSTGRES 数据模型和查询语言

II   The POSTGRES Data Model and Query Language

三、规则体系

III  The Rules System

静脉输液储存系统

IV  Storage System

五、POSTGRES 实现

V    The POSTGRES Implementation

六、现状及业绩

VI  Status and Performance

七、结论

VII  Conclusions

参考

References

INGRES的设计与实现

The Design and Implementation of INGRES

迈克尔·斯通布雷克、尤金·黄、彼得·克雷普斯、杰拉尔德·赫尔德

Michael Stonebraker, Eugene Wong, Peter Kreps, Gerald Held

1 简介

1      Introduction

2 INGRES 流程结构

2      The INGRES Process Structure

3 数据结构和访问方法

3      Data Structures and Access Methods

4 流程2的结构

4      The Structure of Process 2

5 流程3

5      Process 3

6 进程中的实用程序 4

6      Utilities in Process 4

结论和未来的扩展

Conclusion and Future Extensions

致谢

Acknowledgment

参考

References

迈克尔·斯通布雷克文集

The Collected Works of Michael Stonebraker

参考

References

指数

Index

传记

Biographies

数据管理技术 Kairometer:历史背景

Data Management Technology Kairometer: The Historical Context

本书中的故事讲述了与 Michael Stonebraker 在其职业生涯中取得的成就相关的数据管理技术发展中的重大事件,他因此获得了 2014 年 ACM AM 图灵奖。要欣赏 Mike 对数据管理技术的贡献,有助于了解做出贡献的历史背景。数据管理技术 Kairometer 1(可从http://www.morganclaypoolpublishers.com/stonebraker/获取)回答了以下问题:当时在研究、工业和 Mike 的职业生涯中正在发生哪些重要的数据管理事件?

The stories in this book recount significant events in the development of data management technology relative to Michael Stonebraker’s achievements, over his career, for which he received the 2014 ACM A.M. Turing Award. To appreciate Mike’s contributions to data management technology, it helps to understand the historical context in which the contributions were made. A Data Management Technology Kairometer1 (available at http://www.morganclaypoolpublishers.com/stonebraker/) answers the questions: What significant data management events were going on at that time in research, in industry, and in Mike’s career?

在本书涵盖的这些年(1943 年至 2018 年)中,数据管理技术凯尔米计从左到右列出了数据管理研究和行业中的重大事件,其中穿插着迈克职业生涯的事件。根据这些时间线,它按照本书内容的顺序(从上到下)呈现故事,每个故事都有自己的时间线。看一眼 kairometer,您就会了解各种故事的时间线在数据管理研究、行业和 Stonebraker 职业事件的背景下如何相互关联。相对于相关事件,该事件何时发生?

Over the years covered by this book (1943 to 2018), the Data Management Technology Kairometer lays out, left to right, the significant events in data management research and industry interspersed with the events of Mike’s career. Against these timelines, it presents (top to bottom) the stories ordered by the book’s contents, each with its own timeline. A glance at the kairometer tells you how the timelines of the various stories relate in the context of data management research, industry, and Stonebraker career events. When did that event occur relative to associated events?

我们的故事讲述了三种类型的事件(颜色编码):数据管理研究事件(蓝色),例如内存数据库的出现;数据管理行业事件(绿色),例如 IBM DB2 的首次发布;以及 Mike Stonebraker 职业生涯中的里程碑(红色),例如他被任命为加州大学伯克利分校助理教授。黑色故事涉及多种事件类型。事件被分为本书简介(紫色)中描述的四个数据管理时代:导航、关系、一刀切和大数据。数据管理 kairometer 为每个故事提供与重要数据管理研究和行业事件以及 Mike 职业生涯相关的历史背景。

Our stories recount three types of event (color-coded): data management research events (blue), such as the emergence of in-memory databases; data management industry events (green), such as the initial release of IBM’s DB2; and milestones in Mike Stonebraker’s career (red), such as his appointment as an assistant professor at UC Berkeley. Stories in black involve multiple event types. Events are separated into the four data management eras described in the book’s introduction (purple): navigational, relational, one-size-does-not-fit-all, and Big Data. The data management kairometer provides an historical context for each story relative to the significant data management research and industry events and Mike’s career.

图像

1 . 天文钟记录所有事件的经过,而凯时计则记录重大事件的经过。“Kairos ( Kαιρóς ) 是一个古希腊词,意思是正确的、关键的或适当的时刻。古希腊人有两个词来表示时间:chronos ( χρóvoς ) 和 kairos。前者指按时间顺序或顺序的时间,而后者则表示采取行动的适当或适当的时间。计时是定量的,而凯罗斯则具有定性、永久的性质。” (来源: http: //en.wikipedia.org/wiki/Kairos)我们欢迎扩展 kairometer 的想法;联系michaelbrodie@michaelbrodie.com

1. While a chronometer records the passage of all events, a kairometer records the passage of significant events. “Kairos (Kαιρóς) is an Ancient Greek word meaning the right, critical, or opportune moment. The ancient Greeks had two words for time: chronos (χρóvoς) and kairos. The former refers to chronological or sequential time, while the latter signifies a proper or opportune time for action. While chronos is quantitative, kairos has a qualitative, permanent nature.” (source: http://en.wikipedia.org/wiki/Kairos) We welcome ideas to extend the kairometer; contact michaelbrodie@michaelbrodie.com.

前言

Foreword

AM 图灵奖是 ACM 最负盛名的技术奖项,旨在表彰对计算具有持久重要性的重大贡献。图灵奖有时被称为“计算机领域的诺贝尔奖”,其名称是为了纪念英国数学家和计算机科学家艾伦·M·图灵(Alan M. Turing,1912-1954 年)。阿兰·图灵是计算领域的先驱,他在该领域的各个方面取得了根本性的进步,包括计算机体系结构、算法、形式化和人工智能。第二次世界大战期间,他还在英国密码破译工作中发挥了重要作用。

The A.M. Turing Award is ACM’s most prestigious technical award and is given for major contributions of lasting importance to computing. Sometimes referred to as the “Nobel Prize of computing,” the Turing Award was named in honor of Alan M. Turing (1912–1954), a British mathematician and computer scientist. Alan Turing is a pioneer of computing who made fundamental advances in various aspects of the field including computer architecture, algorithms, formalization, and artificial intelligence. He was also instrumental in British code-breaking work during World War II.

图灵奖于1966年设立,至今已有67人获奖。这些获奖者的工作从根本上影响和改变了计算。回顾获奖者的作品可以提供该领域发展的历史视角。

The Turing Award was established in 1966, and 67 people have won the award since then. The work of each of these awardees has influenced and changed computing in fundamental ways. Reviewing the award winners’ work gives a historical perspective of the field’s development.

ACM Books 启动了图灵奖系列来记录每个奖项的发展情况。每本书专门讨论一个奖项,并可能涵盖一名或多名获奖者。我们有两个主要目标。首先是记录获奖作品如何影响和改变计算。每本书都旨在通过对获奖者的采访、他们的图灵讲座、获奖的主要出版物以及同事对工作影响的技术讨论来实现这一目标。第二个目标是庆祝这一非凡且当之无愧的成就。我们与 ACM 历史委员会合作制作这些书籍,并由他们进行采访。

ACM Books has started the Turing Award Series to document the developments surrounding each award. Each book is devoted to one award and may cover one or more awardees. We have two primary objectives. The first is to document how the award-winning works have influenced and changed computing. Each book aims to accomplish this by means of interviews with the awardee(s), their Turing lectures, key publications that led to the award, and technical discussions by colleagues on the work’s impact. The second objective is to celebrate this phenomenal and well-deserved accomplishment. We collaborate with the ACM History Committee in producing these books and they conduct the interviews.

我们希望这些书能让新一代了解我们领域的关键发展,并为历史学家和学生提供更多材料。

Our hope is that these books will allow new generations to learn about key developments in our field and will provide additional material to historians and students.

M.塔梅尔·奥苏

M. Tamer Özsu

主编辑

Editor-in-Chief

前言

Preface

ACM AM 图灵奖

The ACM A.M. Turing Award

本书旨在颂扬 Michael Stonebraker 的成就,他因“对现代数据库系统底层概念和实践的基本贡献”而荣获 2014 年 ACM AM 图灵奖。[ACM 2016]

This book celebrates Michael Stonebraker’s accomplishments that led to his 2014 ACM A.M. Turing Award “For fundamental contributions to the concepts and practices underlying modern database systems.” [ACM 2016]

当图灵奖委员会主席 Barbra Liskov 通知 Mike 他获得了 2014 年图灵奖时,他“……哭了。我一生的工作得到了认可和认可,这让我非常高兴。” [碎石者 2015b]

When Barbra Liskov, Turing Award committee chair, informed Mike that he had been awarded the 2014 Turing Award, he “… teared up. The recognition and validation for my lifetime work was incredibly gratifying.” [Stonebraker 2015b]

该书为广大计算界描述了 Mike 40 多年来在推进现代数据库系统方面所取得的成就的独特性质、意义和影响。如今,数据被认为是世界上最有价值的资源1,无论是在用于管理世界企业和政府的数千万个数据库中,还是在智能手机和手表中的数十亿个数据库中,还是在其他地方。不受管理,等待着难以捉摸的下一代数据库系统。数百万或数十亿个数据库中的每一个都包含 2014 年图灵奖庆祝的功能,并在本书中进行了描述。

The book describes, for the broad computing community, the unique nature, significance, and impact of Mike’s achievements in advancing modern database systems over more than 40 years. Today, data is considered the world’s most valuable resource,1 whether it is in the tens of millions of databases used to manage the world’s businesses and governments, in the billions of databases in our smart-phones and watches, or residing elsewhere, as yet unmanaged, awaiting the elusive next generation of database systems. Every one of the millions or billions of databases includes features that are celebrated by the 2014 Turing Award and are described in this book.

我为什么要关心数据库?什么是数据库?什么是数据管理?什么是数据库管理系统(DBMS)?这些只是本书通过 Mike Stonebraker 及其 200 多名合作者的成就描述数据管理发展的过程中回答的一些问题。在阅读本书中的故事时,您将发现在数据管理技术迄今为止最伟大的两个时代中发展起来的核心数据管理概念。为什么我们需要数据库系统?添加了哪些概念?这些概念从何而来?驱动因素是什么?它们是如何进化的?失败的原因是什么?数据库系统的做法是什么?而且,为什么这些成就值得获得图灵奖?

Why should I care about databases? What is a database? What is data management? What is a database management system (DBMS)? These are just some of the questions that this book answers, in describing the development of data management through the achievements of Mike Stonebraker and his over 200 collaborators. In reading the stories in this book, you will discover core data management concepts that were developed over the two greatest eras—so far—of data management technology. Why do we need database systems at all? What concepts were added? Where did those concepts come from? What were the drivers? How did they evolve? What failed and why? What is the practice of database systems? And, why do those achievements warrant a Turing Award?

虽然本书的重点是 2014 年图灵奖获得者迈克尔·斯通布雷克 (Michael Stonebraker),但该奖项所表彰的成就不仅仅是一个人的成就,无论他/她多么杰出。这些成就还归功于数百名合作者——研究人员、学生、工程师、程序员、公司创始人和支持者、合作伙伴,是的,甚至还有营销和销售人员。所有的想法都是来自迈克吗?请继续阅读,尤其是 Mike 的章节“好创意从何而来以及如何利用它们”(第 10 章)。

While focus of this book is on Michael Stonebraker, the 2014 Turing Award winner, the achievements that the award honors are not just those of one person, no matter how remarkable s/he may be. The achievements are also due to hundreds of collaborators—researchers, students, engineers, coders, company founders and backers, partners, and, yes, even marketing and sales people. Did all of the ideas come from Mike? Read on, especially Mike’s chapter “Where Good Ideas Come from and How to Exploit Them” (chapter 10).

我非常荣幸能够与超过我的图灵奖获得者一起工作,从本科生开始,我就从 P = NP 名声大噪的 Steve Cook 那里学习了复杂性理论。没有两位图灵奖获得者在主题、方法、方法或个性上是相似的。所有这些都非常独特。至少可以说,迈克是个与众不同的人,正如您在这些页面中会发现的那样。

I have had the great privilege of working with more than my fair share of Turing Award recipients starting as an undergraduate taking complexity theory from Steve Cook of P = NP fame. No two Turing Award winners are alike in topic, approach, methods, or personality. All are remarkably idiosyncratic. Mike is, to say the least, idiosyncratic, as you will discover in these pages.

本书通过 30 个故事回答了问题(如斜体字所示),每个故事都由处于故事中心的讲故事者提出。这些故事涉及技术概念、项目、人员、原型系统、失败、幸运事故、疯狂风险、初创公司、产品、风险投资以及推动 Mike Stonebraker 取得成就和职业生涯的许多应用程序。即使您对数据库完全不感兴趣,2您也将从 39 位杰出计算机科学家和专业人士的角度深入了解值得图灵奖的成就的诞生和演变。

This book answers questions, like those in italics, in 30 stories, each by storytellers who were at the center of the story. The stories involve technical concepts, projects, people, prototype systems, failures, lucky accidents, crazy risks, startups, products, venture capital, and lots of applications that drove Mike Stonebraker’s achievements and career. Even if you have no interest in databases at all,2 you’ll gain insights into the birth and evolution of Turing Award-worthy achievements from the perspectives of 39 remarkable computer scientists and professionals.

让数据库发挥作用:Michael Stonebraker 的务实智慧

Making Databases Work: The Pragmatic Wisdom of Michael Stonebraker

本书的主题是现代数据库系统。2014 年 AM 图灵奖被授予“对现代数据库系统底层概念和实践的基本贡献”。这是为数据库颁发的仅有的 4 个图灵奖中的 1 个,也是为计算机系统颁发的 51 个图灵奖中仅有的 2 个之一。

The theme of this book is modern database systems. The 2014 A.M. Turing Award was conferred “For fundamental contributions to the concepts and practices underlying modern database systems.” It is 1 of only 4 Turing Awards given for databases, and 1 of only 2 out of 51 given for computer systems.

迈克在他的图灵奖演讲中讨论了系统主题(通常旨在总结值得图灵的成就),包括他面临的挑战和他在系统研究中采取的方法,分四个步骤。“首先是试图解释为什么系统软件如此难以构建,以及为什么优秀的团队经常搞砸。其次,需要真正的毅力才能“坚持下去”并使事情真正发挥作用。第三个是谈论创业经历,以及为什么风险投资家通常享有“土地鲨鱼”的美誉。最后,很明显,运气在成功的初创企业中起着重要作用,我想解释一下这一点。首要主题是使用重大的身体挑战作为隐喻用于系统软件开发。多年来,从我们 1988 年的越野自行车骑行到我攀登新罕布什尔州所有 48 座 4,000 英尺高的山脉,身体上的挑战有所不同。” [碎石者 2015b]

Mike addressed the systems theme in his Turing Award lecture (typically intended to summarize Turing-worthy achievements) in terms of the challenges that he faced and the approach he took to systems research, in four steps. “The first was to try to explain why system software is so hard to build, and why good teams screw it up on a regular basis. Second, it takes real perseverance to “stick it out” and make something actually work. The third was to talk about the start-up experience, and why venture capitalists usually deserve their reputation as “land sharks.” Lastly, it is clear that luck plays a significant role in successful startups, and I wanted to explain that. The overarching theme was to use a significant physical challenge as a metaphor for system software development. Over the years, the physical challenge has varied between our cross-country bike ride in 1988, and my climbing all forty-eight 4,000-foot mountains in New Hampshire.” [Stonebraker 2015b]

这个描述包含了对前面在本书中详细阐述的斜体问题的答案的种子。

This description contains the seeds of answers to the previous italicized questions that are elaborated throughout the book.

本书通过从研究角度讲述的故事来探讨计算机系统主题:核心数据库概念是什么?他们是如何发展的?为什么它们很重要?以及从计算机系统角度讲述的故事:开发或工程挑战是什么?实施研究想法时会遇到哪些挑战?他们是如何克服的?重要的研究贡献是否来自系统工程?当您阅读这些故事时,问问自己:研究和系统工程之间的关系是什么?为什么要构建原型系统?有了经过研究和原型系统验证的概念,为什么还要构建产品呢?(剧透警告:虽然金钱起着重要作用,但这绝不是目标。)

The computer systems theme is pursued in the book by stories told from the research perspective: What were the core database concepts? How did they develop? Why were they significant? And stories told from the computer systems perspective: What are the development or engineering challenges? What challenges arise in implementing a research idea? How are they overcome? Do essential research contributions arise from systems engineering? As you read these stories ask yourself: What is the relationship between research and systems engineering? Why build prototype systems at all? Having proven concepts in research and in a prototype system, why build a product? (Spoiler alert: While money plays a significant role, it was by no means the goal.)

致谢 39 位杰出贡献者

Acknowledging 39 Remarkable Contributors

本书收录了 Mike 3和他的 38 位合作者撰写的 36 个故事:23 位世界领先的数据库研究人员、11 位世界一流的系统工程师和 4 位商业伙伴。他们得到了一名编辑和四名专业出版商和编辑的协助。

This book is a collection of 36 stories written by Mike3 and 38 of his collaborators: 23 world-leading database researchers, 11 world-class systems engineers, and 4 business partners. They were aided by an editor and 4 professional publishers and editors.

我非常高兴地承认所有这些杰出人士做出的令人着迷的贡献。他们热情地回应,讲述了他们与迈克的合作,寻找重要的贡献以及它们是如何出现的,所有这些都与读者准确记住关键事实的关注有关——在某些情况下可以追溯到四十年前。什么是重要的?什么看起来重要,什么真正重要?正如您将看到的,每个贡献者(例如迈克)都是独特且固执己见的。他们的成就反映了当时的技术状况和数据管理需求。必须提醒每个人反映与迈克的分歧(显示计算研究和产品开发的正常交换),并说明为什么迈克的贡献值得获得图灵奖。有趣的是,很少有作者愿意赞扬迈克,这也许反映了计算机科学家的个性。

It is my great pleasure to acknowledge the fascinating contributions of all of these remarkable people. They responded with enthusiasm to recount their collaborations with Mike, looking for the essential contributions and how they emerged, all mixed with concern for accurately remembering the crucial facts for you, the reader—in some cases reaching back four decades. What was important? What seemed to matter vs. what really mattered? Each contributor, like Mike, is idiosyncratic and strongly opinionated, as you will see. Their achievements reflect the state of the technology and data management demands of the time. Everyone had to be reminded to reflect disagreements with Mike (showing the normal give-and-take of computing research and product development), as well as to state why Mike’s contributions warranted the Turing Award. Interestingly, few authors felt comfortable praising Mike, perhaps reflecting the personalities of computer scientists.

迈克可能会令人生畏。正如菲尔·伯恩斯坦 (Phil Bernstein) 所描述的那样,他的职业生涯就是发表大胆、黑白分明的声明来挑战并激励自己和数据库社区取得更大的成就。4对于一个博士、博士后或任何类型的合作者来说,站出来反驳迈克是一种成熟的标志,而且是一种很愉快的经历,包括迈克。您将在每个故事中看到贡献者为通过这一基准所付出的努力。

Mike can be intimidating. He has made a career of making bold, black-and-white statements to challenge and to inspire himself and the database community to greater accomplishments, as Phil Bernstein recounts so well.4 It’s a sign of maturity for a Ph.D., postdoc, or collaborator of any kind to stand up and refute Mike—and such a pleasure to experience, by Mike included. You will see, in each story, the efforts of the contributors to pass this benchmark.

迈克职业生涯的一个主题是质疑传统智慧,特别是随着传统智慧的老化、新挑战和新概念的出现,或者当它被不良实践所破坏时。最明显的例子是 Mike 声称数据库中“一刀切”,这与大象的说法完全矛盾,大象是 Mike 对主导市场的 DBMS 的亲切称呼。然而,Mike 是关系数据库时代“一刀切”的主要支持者。看着迈克的贡献成为传统智慧,然后迈克对下一个成就水平提出质疑,这真是令人着迷。

A theme of Mike’s career has been to question conventional wisdom, specifically as it ages and as new challenges and concepts arise or as it is undermined by poor practices. The most obvious example is Mike’s claim that “one-size-does-not-fit-all” in databases, which is a complete contradiction of the claims of the Elephants, Mike’s affectionate term for the DBMSs that dominate their market. Yet, Mike was the chief proponent of “one-size-fits-all” in the relational database era. It has been fascinating to watch Mike’s contributions become conventional wisdom, which Mike then questions toward the next level of achievement.

如果您是一名有抱负的研究员、工程师或企业家,您可能会阅读这些故事,找到这些转折点,作为练习倾斜您自己的计算机科学风车,激励自己迈向下一步的创新和成就。

If you are an aspiring researcher, engineer, or entrepreneur you might read these stories to find these turning points as practice to tilt at your own computer-science windmills, to spur yourself to your next step of innovation and achievement.

珍妮丝·布朗,我们的阿曼努西斯

Janice Brown, Our Amanuensis

我最要感谢的是 Janice Brown & Associates, Inc. 的技术作家/编辑、创业顾问以及 Stonebraker 的频繁合作者 Janice L. Brown。从很多方面来说,这都是 Janice 的书。(如果你梦想着这本书,它就是你的。) Janice 是我们的抄写员:编辑、撰稿人、爱好者、评论家、5和 berger du chats(猫牧人),非凡和恶意,非常必要。

My greatest acknowledgement, for her contributions to this book, is for Janice L. Brown, technology writer/editor, startup consultant, and frequent Stonebraker collaborator, of Janice Brown & Associates, Inc. In many ways this is Janice’s book. (If you dream about the book, it’s yours.) Janice was our amanuensis: editor, copywriter, enthusiast, critic,5 and berger du chats (cat herder) extraordinaire et malheureusement, très necessaire.

迈克尔·布罗迪

Michael Brodie

2018年10月

October 2018

1 . 《经济学人》,2017 年 5 月 6 日。

1. The Economist, May 6, 2017.

2 . 那可能吗?

2. Is that possible?

3 . 迈克写了一些最有趣的章节;他没有审阅其他章节,以免影响作者的声音。

3. Mike wrote some of the most interesting chapters; he did not review other chapters so as not to influence the authors’ voices.

4 . 请参阅第 3 章“领导力和宣传”。

4. See Chapter 3, “Leadership and Advocacy.”

5 . “确实是一个很棒的故事,但也许你并不是这个意思;怎么样 …”

5. “Truly a great story, yet perhaps you didn’t really mean that; how about …”

介绍

Introduction

迈克尔·L·布罗迪

Michael L. Brodie

我们的故事始于 1971 年越南战争最激烈的加州大学伯克利分校。一位身材高大、雄心勃勃的新任助理教授,拥有电子工程博士学位。在马尔可夫链中,他向一位经验丰富的同事寻求指导,以塑造他刚刚起步的职业生涯,以获得比他想象的马尔可夫链更大的影响力。1 Eugene Wong 教授建议 Michael Stonebraker 阅读 Ted Codd 刚刚发表的关于一个引人注目的新想法的论文:关系数据库。读完这篇论文后,迈克立即相信了它的潜力,尽管他对数据库基本上一无所知。他把目光投向了那个原始的关系山顶。剩下的就是历史了。或者更确切地说,这就是本书的主题。Ted 的简单、优雅的关系数据模型和 Mike 的贡献让数据库在实践中发挥作用有助于打造当今价值 550 亿美元的产业。但是,等等,我有点超前了。

Our story begins at University of California, Berkeley in 1971, in the heat of the Vietnam War. A tall, ambitious, newly minted assistant professor with an EE Ph.D. in Markov chains asks a seasoned colleague for direction in shaping his nascent career for more impact than he could imagine possible with Markov chains.1 Professor Eugene Wong suggests that Michael Stonebraker read Ted Codd’s just-published paper on a striking new idea: relational databases. Upon reading the paper, Mike is immediately convinced of the potential, even though he knows essentially nothing about databases. He sets his sights on that pristine relational mountaintop. The rest is history. Or rather, it is the topic of this book. Ted’s simple, elegant, relational data model and Mike’s contributions to making databases work in practice helped forge what is today a $55 billion industry. But, wait, I’m getting ahead of myself.

数据库简史

A Brief History of Databases

什么是数据库?什么是数据管理?它们是如何演变的?

What’s a database? What is data management?How did they evolve?

想象一下,癌症病因的发现和最有效的治疗方法的发现速度将提高几个数量级,从而从根本上减少全球每年 1000 万人的死亡。或者使自动驾驶汽车能够大幅减少全球每年 100 万人的交通死亡,同时减少污染、交通拥堵以及平均 97% 时间都停放在车辆上的房地产浪费。这些只是使用大数据未来潜在积极影响的两个例子。与所有技术进步一样,也有可能产生负面影响,既有无意的(错误的),也有故意的,例如破坏现代民主国家(据说很好)进行中)。使用数据意味着大规模有效地管理数据;这就是数据管理的目的。

Imagine accelerating by orders of magnitude the discovery of cancer causes and the most effective treatments to radically reduce the 10M annual deaths worldwide. Or enabling autonomous vehicles to profoundly reduce the 1M annual traffic deaths worldwide while reducing pollution, traffic congestion, and real estate wasted on vehicles that on average are parked 97% of the time. These are merely two examples of the future potential positive impacts of using Big Data. As with all technical advances, there is also potential for negative impacts, both unintentional—in error—and intentional such as undermining modern democracies (allegedly well under way). Using data means managing data efficiently at scale; that’s what data management is for.

《经济学人》在 2017 年 5 月 6 日刊中宣称数据是世界上最有价值的资源。2012 年,数据科学(通常是人工智能驱动的大规模数据分析)在世界舞台上爆发。这种新的、数据驱动的发现范式可能是 21 世纪初最重要的进步之一。非从业者总是惊讶地发现数据科学项目所需的资源 80% 都用于数据管理。令人惊讶的是,数据管理是一项基础设施技术:基本上是看不见的管道。在商业中,数据管理是通过成熟、强大的数据库管理系统 (DBMS)“完成”的。但数据科学带来了新的、重大的数据管理挑战,这些挑战尚未被理解,更不用说解决了。

In its May 6, 2017 issue, The Economist declared data to be the world’s most valuable resource. In 2012, data science—often AI-driven analysis of data at scale—exploded on the world stage. This new, data-driven discovery paradigm may be one of the most significant advances of the early 21st century. Non-practitioners are always surprised to find that 80% of the resources required for a data science project are devoted to data management. The surprise stems from the fact that data management is an infrastructure technology: basically, unseen plumbing. In business, data management “just gets done” by mature, robust database management systems (DBMSs). But data science poses new, significant data management challenges that have yet to be understood, let alone addressed.

上述潜在的好处、风险和挑战预示着数据管理技术(本书的主题)发展的新时代。它最初是如何发展的?以前的时代是什么?未来面临哪些挑战?本简介简要概述了这些问题的答案。

The above potential benefits, risks, and challenges herald a new era in the development of data management technology, the topic of this book. How did it develop in the first place? What were the previous eras? What challenges lie ahead? This introduction briefly sketches answers to those questions.

本书讲述了迈克尔·斯通布雷克(Michael Stonebraker)及其合作者的贡献所促成的数据管理技术的发展和优势。数据管理诞生于 20 世纪 60 年代,当时已成为全球各种规模企业的关键支持技术,催生了当今 $55B 2 , 3 DBMS 市场和数千万个操作数据库。财富 100 强公司平均拥有超过 5,000 个运营数据库,并由数十种 DBMS 产品提供支持。

This book is about the development and ascendancy of data management technology enabled by the contributions of Michael Stonebraker and his collaborators. Novel when created in the 1960s, data management became the key enabling technology for businesses of all sizes worldwide, leading to today’s $55B2,3 DBMS market and tens of millions of operational databases. The average Fortune 100 company has more than 5,000 operational databases, supported by tens of DBMS products.

数据库支持您的日常活动,例如保护您的银行和信用卡交易。为了让您能够购买星巴克拿铁,全球领先的支付处理商 Visa 必须能够每秒同时处理 50,000 笔信用卡交易。这些“数据库”交易不仅会更新您的账户和其他 49,999 名 Visa 持卡人的账户,还会更新星巴克等 50,000 个债权人的账户,同时验证您、星巴克和其他 99,998 名其他人是否存在欺诈行为,无论您的卡在地球上的哪个地方被刷。您金融世界的另一部分可能涉及美国主要市场交易所每天发生的超过 3.8B 笔贸易交易。DBMS 不仅支持金融系统的关键功能,还支持管理库存、空中交通管制、供应链、以及依赖于数据和数据交易的所有日常功能。由 DBMS 管理的数据库甚至可以在您的手腕上和口袋里如果您像地球上 2.3B 的其他人一样使用 iPhone 或 Android 智能手机。如果数据库停止运行,我们世界的大部分内容也会随之停止。

Databases support your daily activities, such as securing your banking and credit card transactions. So that you can buy that Starbucks latte, Visa, the leading global payments processor, must be able to simultaneously process 50,000 credit card transactions per second. These “database” transactions update not just your account and those of 49,999 other Visa cardholders, but also those of 50,000 creditors like Starbucks while simultaneously validating you, Starbucks, and 99,998 others for fraud, no matter where your card was swiped on the planet. Another slice of your financial world may involve one of the more than 3.8B trade transactions that occur daily on major U.S. market exchanges. DBMSs support such critical functions not just for financial systems, but also for systems that manage inventory, air traffic control, supply chains, and all daily functions that depend on data and data transactions. Databases managed by DBMSs are even on your wrist and in your pocket if you, like 2.3B others on the planet, use an iPhone or Android smartphone. If databases stopped, much of our world would stop with them.

数据库是数据的逻辑集合,例如您的信用卡帐户信息,通常存储在根据某种数据模型组织的记录中,例如关系数据库中的表。DBMS 是一种软件系统,用于管理数据库并确保持久性(因此数据不会丢失),使用语言插入、更新和查询数据,通常数据量非常大且延迟较低,如上图所示。

A database is a logical collection of data, like your credit card account information, often stored in records that are organized under some data model, such as tables in a relational database. A DBMS is a software system that manages databases and ensures persistence—so data is not lost—with languages to insert, update, and query data, often at very high data volumes and low latencies, as illustrated above.

正如《数据管理技术凯尔计》(见第 xxvii 页)所示,数据管理技术在 60 年间经历了四个时代的发展。

Data management technologies have evolved through four eras over six decades, as is illustrated in the Data Management Technology Kairometer (see page xxvii).

在最初的航海时代(20 世纪 60 年代),第一个 DBMS 出现了。其中,数据(例如抵押信息)以层次结构或网络进行构建,并使用一次记录的导航查询语言进行访问。导航时代因查理·巴赫曼“对数据库技术的杰出贡献”而于 1973 年获得了数据库图灵奖

In the inaugural navigational era (1960s), the first DBMSs emerged. In them, data, such as your mortgage information, was structured in hierarchies or networks and accessed using record-at-a-time navigation query languages. The navigational era gained a database Turing Award in 1973 for Charlie Bachman’s “outstanding contributions to database technology.”

在第二个关系时代(1970 年代 - 1990 年代),数据存储在使用声明式、一次设置的查询语言 SQL 访问的表中:例如,“选择平均成绩为 B 的工程专业学生的姓名、年级” ”。关系时代结束时,大约有 30 个商业 DBMS,主要是 Oracle 的 Oracle、IBM 的 DB2 和微软的 SQL Server。关系时代获得了两项数据库图灵奖:一项是在 1981 年,表彰 Codd “对数据库管理系统的理论和实践,特别是数据库管理系统的基础和持续贡献”。关系数据库”,以及 1998 年 Jim Gray 的“对数据库和事务处理研究的开创性贡献以及系统实施方面的技术领导力”。

In the second, relational era (1970s–1990s), data was stored in tables accessed using a declarative, set-at-a-time query language, SQL: for example, “Select name, grade From students in Engineering with a B average.” The relational era ended with approximately 30 commercial DBMSs dominated by Oracle’s Oracle, IBM’s DB2, and Microsoft’s SQL Server. The relational era gained two database Turing Awards: one in 1981 for Codd’s “fundamental and continuing contributions to the theory and practice of database management systems, esp. relational databases” and one in 1998 for Jim Gray’s “seminal contributions to database and transaction processing research and technical leadership in system implementation.”

Mike Stonebraker 的数据库职业生涯的开始恰逢关系时代的到来。偶然间,Mike 被他的同事、加州大学伯克利分校教授 Eugene Wong 通过 Ted 的论文引导到了数据管理主题和关系模型。与导航 DBMS 相比,该模型的简单性吸引了 Mike,他将目光投向了使关系数据库发挥作用,并最终建立了自己的职业生涯。他最初的贡献 Ingres 和 Postgres 在使关系数据库在实践中发挥作用方面所做的贡献比任何其他人都多。30 多年后,Postgres(通过 PostgreSQL 和其他衍生产品)继续产生重大影响。PostgreSQL是第三个4还是第四个5数百种 DBMS 中最流行(使用)的 DBMS 以及所有关系 DBMS(简称 RDBMS)都实现了 Postgres 中引入的对象关系数据模型和功能。

The start of Mike Stonebraker’s database career coincided with the launch of the relational era. Serendipitously, Mike was directed by his colleague, UC Berkeley professor Eugene Wong, to the topic of data management and to the relational model via Ted’s paper. Attracted by the model’s simplicity compared to that of navigational DBMSs, Mike set his sights, and ultimately built his career, on making relational databases work. His initial contributions, Ingres and Postgres, did more to make relational databases work in practice than those of any other individual. After more than 30 years, Postgres—via PostgreSQL and other derivatives—continues to have a significant impact. PostgreSQL is the third4 or fourth5 most popular (used) of hundreds of DBMSs, and all relational DBMSs, or RDBMSs for short, implement the object-relational data model and features introduced in Postgres.

在第一个关系型十年中,研究人员开发了核心 RDBMS 功能,包括查询优化、事务和分发。在第二个关系十年中,研究人员专注于数据仓库的高性能查询(使用列存储)和高性能事务和实时分析(使用内存数据库);扩展 RDBMS 以处理其他数据类型和处理(使用抽象数据类型);并通过将无数的应用程序类型安装到 RDBMS 中来测试“一刀切”原则。但复杂的数据结构和操作(例如地理信息、图形和稀疏矩阵的科学处理中的数据结构和操作)并不适合。Mike 之所以知道这一点,是因为他使用真实、复杂的数据库应用程序来突破 RDBMS 的限制,务实地用尽了这一前提,最终得出结论:“单一尺寸确实——适合所有人”并转向专用数据库。

In the first relational decade, researchers developed core RDBMS capabilities including query optimization, transactions, and distribution. In the second relational decade, researchers focused on high-performance queries for data warehouses (using column stores) and high-performance transactions and real-time analysis (using in-memory databases); extended RDBMSs to handle additional data types and processing (using abstract data types); and tested the “one-size-fits-all” principle by fitting myriad application types into RDBMSs. But complex data structures and operations—such as those in geographic information, graphs, and scientific processing over sparse matrices—just didn’t fit. Mike knew because he pragmatically exhausted that premise using real, complex database applications to push RDBMS limits, eventually concluding that “one-size-does-not-fit-all” and moving to special-purpose databases.

与数据库研究界的其他人一起,Mike 将注意力转向特殊用途数据库,开启了第三个“一刀切”时代(2000-2010 年)。“一刀切”时代的研究人员开发了针对“一刀切”数据和相关处理(例如时间序列、半和非数据)的数据管理解决方案。结构化数据、键值数据、图形、文档),其中数据以特殊用途的形式存储并使用非 SQL 查询语言(称为 NoSQL)进行访问,其中 Hadoop 是最著名的例子。Mike 在数据管理器中追求非关系型挑战,但声称 NoSQL 效率低下,因此继续利用 SQL 的声明性功能,使用类似 SQL 的语言(包括名为 NewSQL 的扩展)来访问数据。

With the rest of the database research community, Mike turned his attention to special-purpose databases, launching the third, “one-size-does-not-fit-all” era (2000–2010). Researchers in the “one-size-does-not-fit-all” era developed data management solutions for “one-size-does-not-fit-all” data and associated processing (e.g., time series, semi- and un-structured data, key-value data, graphs, documents) in which data was stored in special-purpose forms and accessed using non-SQL query languages, called NoSQL, of which Hadoop is the best-known example. Mike pursued non-relational challenges in the data manager but, claiming inefficiency of NoSQL, continued to leverage SQL’s declarative power to access data using SQL-like languages, including an extension called NewSQL.

由于专业 DBMS 的研究推动和应用拉动以及开源软件的发展,新的 DBMS 激增。“一刀切”最终形成了 350 多个 DBMS,商业和开源之间各占一半,支持专业数据和相关处理,包括(按使用顺序):关系、键值、文档、图表、时间序列、RDF、对象(面向对象)、搜索、宽列、多值/维度、原生 XML、内容(例如数字、文本、图像)、事件和导航。尽管 DBMS 的选择多种多样,但市场仍然由五种关系 DBMS 主导:最初的 3 种加上 Microsoft Access 和 Teradata,迈克将它们统称为“大象”。尽管大象都支持迈克的对象关系模型,但它们已经成为落后于新数据管理能力的“传统智慧”。迈克职业生涯的一个标志是不断质疑传统智慧,甚至是他自己创造的智慧。在“一刀切”时代结束时,DBMS 市场发生了重大转变,从 RDBMS 大象转向更便宜的开源 DBMS。

New DBMSs proliferated, due to the research push for and application pull of specialized DBMSs, and the growth of open-source software. The “one-size-does-not-fit-all” ended with more than 350 DBMSs, split evenly between commercial and open source and supporting specialized data and related processing including (in order of utilization): relational, key-values, documents, graphs, time series, RDF, objects (object-oriented), search, wide columns, multi-value/dimensional, native XML, content (e.g., digital, text, image), events, and navigational. Despite the choice and diversity of DBMSs, the market remained dominated by five relational DBMSs: the original three plus Microsoft Access and Teradata, which Mike came to call collectively “the Elephants.” Although the Elephants all supported Mike’s object-relational model, they had become “conventional wisdom” that lagged new data management capabilities. A hallmark of Mike’s career is to perpetually question conventional wisdom, even of his own making. At the end of the “one-size-does-not-fit-all” era, there was a significant shift in the DBMS market away from the RDBMS Elephants and to less-expensive open-source DBMSs.

关系型时代和“一刀切时代因 Stonebraker 的“现代数据库系统底层的概念和实践”而获得了数据库图灵奖 这些页面充满了这些相关的和“一刀切”发展的故事。您会发现迈克和他的合作者通过将它们变为现实的项目、产品和公司,讲述它们的诞生、演变、实验、演示和实现的故事。

The relational and “one-size-does-not-fit-all” eras gained a database Turing Award for Stonebraker’s “concepts and practices underlying modern database systems.” It is the stories of these relational and “one-size-does-not-fit-all” developments that fill these pages. You will find stories of their inception, evolution, experimentation, demonstration, and realization by Mike and his collaborators through the projects, products, and companies that brought them to life.

这将我们带入第四个大数据时代(2010 年至今),其特点是数据量大、速度快、种类多(异质性),现有数据管理技术无法充分处理这些数据。请注意,奇怪的是,“大数据”是根据数据管理技术来定义的,而不是根据数据所代表的典型现实世界现象来定义的。继之前的时代之后,人们可能会认为大数据时代是由数据库研究界追求满足未来数据管理需求而启动的。事实并非如此。三十年来,数据库社区的重点仍然是关系型和“一刀切”,很少关注更大的数据管理挑战,即管理所有数据顾名思义,数据就是“数据管理”。2012 年年度 IDC/EMC 数字宇宙研究 [Gantz 和 Reinsel 2013] 估计,在不断扩展的数字宇宙中的所有数据中,只有不到 15% 适合现有 DBMS。像雅虎这样的大型企业!谷歌面临着巨大的数据管理挑战,而产品或研究原型中没有数据管理解决方案。因此,问题所有者构建了自己的解决方案,从而诞生了 Hadoop、MapReduce 和无数的 NoSQL 大数据管理器。2009 年,Mike 对 MapReduce 提出了著名的批评,令大数据社区感到懊恼,但五年后,当其创建者透露,Mike 的批评恰逢他们放弃 MapReduce 和 Hadoop,转而采用自己的另一轮大数据管理解决方案时,结果得到了证实。制作。这表明了大数据时代的挑战,如果没有六年来建立的数据管理基础,大规模数据管理就很难解决。回顾过去,它说明了之前数据库时代所面临的挑战,并验证了这些解决方案值得获得数据库图灵奖。

This brings us to the fourth and current Big Data era (2010–present), characterized by data at volumes, velocities, and variety (heterogeneity) that cannot be handled adequately by existing data management technology. Notice that, oddly, “Big Data” is defined in terms of data management technology rather than in terms of the typically real-world phenomena that the data represents. Following the previous eras, one might imagine that the Big Data era was launched by the database research community’s pursuit of addressing future data management requirements. That is not what happened. The database community’s focus remained on relational and “one-size-does-not-fit-all” for three decades with little concern for a grander data management challenge—namely managing all data as the name “data management” suggests. The 2012 annual IDC/EMC Digital Universe study [Gantz and Reinsel 2013] estimated that of all data in the expanding digital universe, less than 15% was amenable to existing DBMSs. Large enterprises like Yahoo! and Google faced massive data management challenges for which there were no data management solutions in products or research prototypes. Consequently, the problem owners built their own solutions, thus the genesis of Hadoop, MapReduce, and myriad NoSQL Big Data managers. In 2009, Mike famously criticized MapReduce to the chagrin of the Big Data community, only to be vindicated five years later when its creators disclosed that Mike’s criticism coincided with their abandonment of MapReduce and Hadoop for yet another round of Big Data management solutions of their own making. This demonstrates the challenges of the Big Data era and that data management at scale is hard to address without the data management underpinnings established over six decades. Retrospectively, it illustrates the challenges faced in the previous database eras and validates that the solutions warranted database Turing Awards.

我们似乎正在进入数据的黄金时代,这很大程度上是由于我们对大数据的期望:数据将推动和加速每个有足够数据可用的领域的进步。尽管这个时代从2010年开始,但相应的数据管理解决方案却进展甚微。正如 Stonebraker 喜欢说的那样,传统的 DBMS“智慧”,而架构似乎并不适用。缺乏有效的用例阻碍了进步,至少是 Stonebraker 式的进步(正如我们将在整本书中看到的那样)。几乎没有 (1) 相当容易理解的大数据管理应用程序 (2) 由某个经历过其他人没有解决过的痛苦的人拥有,(3) 愿意提供对其数据的访问并对其开放探索解决痛苦的新方法。

We appear to be entering a golden age of data, largely due to our expectations for Big Data: that data will fuel and accelerate advances in every field for which adequate data is available. Although this era started in 2010, there has been little progress in corresponding data management solutions. Conventional DBMS “wisdom,” as Stonebraker loves to say, and architectures do not seem to apply. Progress, at least progress Stonebraker-style (as we will see throughout the book), is hindered by a paucity of effective use cases. There are almost no (1) reasonably well-understood Big Data management applications that are (2) owned by someone with a pain that no one else has addressed, with (3) the willingness to provide access to their data and to be open to exploring new methods to resolve the pain.

正如斯坦福大学的 Michael Cafarella 和 Chris Ré 在 Cafarella 和 Ré [2018] 中所说,这一点的好处是数据密集型应用程序的爆炸式增长,为数据库研究带来了“令人眼花缭乱的光明未来”。数据密集型处理的主导类别是数据科学,它本身刚刚作为一个领域出现。因此,上面使用它作为需要有效的新数据库管理技术的例子。随着数据管理技术发展故事的继续,人们对数据的巨大兴趣需要区分核心数据管理技术(本书的主题)及其支持每个人类努力活动的开发。

The bright side of this is the explosion of data-intensive applications leading to, as Michael Cafarella and Chris Ré of Stanford University say in Cafarella and Ré [2018], a “blindingly bright future” for database research. The dominant class of data-intensive processing is data science, itself just emerging as a domain; thus, its use above as an example of the need for effective new database management technologies. As the data management technology evolution story continues, the overwhelming interest in data requires a distinction between core data management technology, the topic of this book, and its development to support activities in every human endeavor.

如果您是一名年轻的研究人员或工程师,正在考虑自己在数据管理或计算机系统方面的未来,您现在会发现自己正处于大数据时代的黎明,就像迈克发现自己正处于关系时代的开端一样。正如迈克对当时的关系数据库新概念不熟悉一样,您对大数据系统和数据管理器的尚未定义的概念也一定很陌生。这些页面从他的主要合作者的许多不同角度讲述了迈克的职业生涯故事。这些故事可能会给你一个历史视角,并可能为你的道路提供指导:要追求什么问题,如何选择它们,如何追求它们,如何与补充你知识的人合作等等。这些故事不仅讲述了挑战、技术、方法和协作方式,还有人们的态度,也许比技术更能影响当今数据管理的状态,特别是迈克·斯通布雷克 (Mike Stonebraker) 及其合作者所取得的成就。然而,正如迈克在 1971 年开始其职业生涯时所讨论的那样,现在的世界已经不同了。第11章,“我们失败的地方。”

If you are a young researcher or engineer contemplating your future in data management or in computer systems, you now find yourself at the dawn of the Big Data era, much as Mike found himself at the beginning of the relational era. Just as Mike was new to the then-new idea of relational databases, so too you must be new to the as-yet-undefined notion of a Big Data system and data manager. These pages tell the stories of Mike’s career as seen from the many different perspectives of his primary collaborators. These stories may give you a historical perspective and may provide you guidance on your path: what issues to pursue, how to select them, how to pursue them, how to collaborate with people who complement your knowledge, and more. These stories recount not only challenges, technology, methods, and collaboration styles, but also people’s attitudes that, perhaps more so than technology, contributed to the state of data management today, and specifically to the achievements of Mike Stonebraker and his collaborators. However, the world now is not as it was in 1971 as Mike launched his career, as Mike discusses in Chapter 11, “Where We Have Failed.”

准备阅读故事以及您可能会发现什么

Preparing to Read the Stories and What You Might Find There

为了在旅行中探索和了解立陶宛,很少有人会考虑参观每一个城市、城镇和村庄。那么你如何选择看什么?《孤独星球》或《目击者》的爱沙尼亚、拉脱维亚和立陶宛旅行指南都不错,但每本都有 500 页。最好是有动力并解决了您想要回答的最初问题。这是访问立陶宛和阅读本书故事的方法。

To explore and understand Lithuania on a trip, few people would consider visiting every city, town, and village. So how do you choose what to see? Lonely Planet’s or Eyewitness’ travel guides to Estonia, Latvia, and Lithuania are fine, but they are 500 pages each. It’s better to be motivated and have worked out the initial questions that you would like answered. Here is a method for visiting Lithuania and for reading the stories in this book.

这本书是由计算机系统研究人员、工程师、开发人员、初创公司经理、资助者、硅谷企业家、高管和投资者为像他们这样的人以及广大的计算社区编写的。假设您渴望从事或目前正在担任其中一个职位:您想从这个角度学到什么?

The book was written by computer systems researchers, engineers, developers, startup managers, funders, Silicon Valley entrepreneurs, executives, and investors for people like themselves and for the broad computing community. Let’s say you aspire to a career in, or are currently in, one of those roles: What would you like to learn from that perspective?

例如,假设您是意大利乌迪内的一位新教授,正在规划您的软件系统研究职业生涯。1971 年,迈克·斯通布雷克 (Mike Stonebraker) 在加州大学伯克利分校时,你会问:我的职业生涯想做什么?我该如何实现这一职业?列出您想要解决的问题,例如本引言和前言中的 20 个左右斜体问题。选择那些在您感兴趣的背景下似乎能为您的问题提供答案的故事。每个故事都会取得重大成功,也许是在您感兴趣的不同主题上,或者在不同的文化和不同的时间。尽管如此,方法、态度和教训通常与具体故事无关。撰稿人试图用长达 40 年的事后见识和经验来概括这些故事。

Say, for example, that you are a new professor in Udine, Italy, planning your career in software systems research. As Mike Stonebraker was at UC Berkeley in 1971, you ask: What do I want to do in my career? How do I go about realizing that career? Develop a list of questions you would like to pursue such as the 20 or so italicized questions in this Introduction and the Preface. Select the stories that appear to offer answers to your questions in a context in which you have some passion. Each story results in significant successes, perhaps on different topics that interest you, or in a different culture and a different time. Nonetheless, the methods, the attitudes, and the lessons are generally independent of the specific story. The contributors have attempted to generalize the stories with the hindsight and experience of as much as 40 years.

选择你的角色,找出你需要指导的问题,选择你自己的观点(书中有 30 个),然后开始你的旅程,当它没有帮助时就放弃。成为您自己的软件系统ciceroni。6

Choose your role, figure out the questions on which you would like guidance, choose your own perspective (there are 30 in the book), and set off on your journey, bailing when it is not helpful. Become your own software systems ciceroni.6

软件系统课程旅行指南(共九部分)

A Travel Guide to Software Systems Lessons in Nine Parts

本书共30个故事,分为9个部分。

The 30 stories in this book are arranged into 9 parts.

第一部分“2014 ACM AM 图灵奖论文和讲座”包含图灵奖获得者通常描述获奖成就的论文。该论文在获奖年度的 ACM 会议上作为演讲发表。忠实于迈克的特质,他利用这篇论文和讲座作为他认为对社区最重要的信息的平台,而不是已经发表的技术成就。迈克写了软件系统带来的挑战,这也是本书的主题,正如前言中所解释的那样。迈克通过类比来描述挑战的本质,这是大多数观众都能理解的来自迈克个人生活的重大身体挑战。论文转载自ACM Communications。该讲座最初于 2015 年 6 月 13 日在联邦计算研究会议上发表,可以在线观看。7

Part I, “2014 ACM A.M. Turing Award Paper and Lecture,” contains the paper in which Turing awardees typically describe the achievements for which the award was conferred. The paper is given as a lecture at ACM conferences during the award year. True to Mike’s idiosyncratic nature, he used the paper and lecture as a platform for what he considered his most important message for the community, as opposed to the already published technical achievements. Mike writes of challenges posed by software systems, the theme of this book, as explained in the Preface. Mike described the nature of the challenges by analogy with significant physical challenges—something most audiences can understand—from Mike’s own personal life. The paper is reproduced from the Communications of the ACM. The lecture was given initially at the Federated Computing Research Conference, June 13, 2015, and can be viewed online.7

第二部分“迈克·斯通布雷克的职业生涯”在萨姆·马登的传记和两幅图画中阐述了迈克的职业生涯。图 1 按时间顺序列出了 Mike 的博士学位。学生和博士后。图2展示了学术项目和奖项他的公司的创建和收购。2014 年 4 月 12 日,Mike Carey(加州大学欧文分校)、David DeWitt(当时在威斯康星大学)、Joe Hellerstein(加州大学伯克利分校)、Sam Madden(麻省理工学院)、Andy Pavlo(卡内基梅隆大学)、和 Margot Seltzer(哈佛大学)为 Mike 组织了一场纪念活动:为 Mike Stonebraker 70 岁举行了为期一天的庆祝活动。8超过200 名现任和前任同事、投资者、合作者、竞争对手和学生参加了活动。演讲嘉宾和讨论小组讨论了 Mike 40 多年职业生涯中的主要项目。图 2“迈克·斯通布雷克的职业生涯”是由安迪·帕夫洛 (Andy Pavlo) 和他的妻子为 Festschrift 制作的。

Part II, “Mike Stonebraker’s Career,” lays out Mike’s career in Sam Madden’s biography and in two graphic depictions. Chart 1 lists chronologically Mike’s Ph.D. students and postdocs. Chart 2 illustrates the academic projects and awards and the creation and acquisition of his companies. On April 12, 2014, Mike Carey (University of California, Irvine), David DeWitt (then at University of Wisconsin), Joe Hellerstein (University of California, Berkeley), Sam Madden (MIT), Andy Pavlo (Carnegie Mellon University), and Margot Seltzer (Harvard University) organized a Festschrift for Mike: a day-long celebration of Mike Stonebraker at 70.8 More than 200 current and former colleagues, investors, collaborators, rivals, and students attended. It featured speakers and discussion panels on the major projects from Mike’s 40-plus year career. Chart 2, “The Career of Mike Stonebraker,” was produced for the Festschrift by Andy Pavlo and his wife.

第三部分“迈克·斯通布雷克的畅所欲言:玛丽安·温斯莱特访谈”是玛丽安·温斯莱特对数据库贡献者的传奇访谈系列中的图灵奖后采访。采访视频可在线观看:https://www.youtube.com/watch?v =vQIkkDaw6iE 。

Part III, “Mike Stonebraker Speaks Out: An Interview with Marianne Winslett,” is a post-Turing-Award interview in the storied series of interviews of database contributors by Marianne Winslett. A video of the interview can be seen online at https://www.youtube.com/watch?v=vQIkkDaw6iE.

第四部分“大局”中,世界领先的研究人员、工程师和企业家反思了迈克在更广阔的范围内的贡献。领先的研究员和大思想家菲尔·伯恩斯坦 (Phil Bernstein) 反思了迈克的领导力和倡导能力。世界级工程师、DB2(大象)的前首席架构师 James Hamilton 反思了图灵贡献的价值。杰里·赫尔德 (Jerry Held),迈克的第一个博士学位。学生现在是硅谷领先的企业家,他讲述了与迈克合作和竞争的经历。Dave DeWitt 在数据管理方面的贡献与 Mike 相当,他回顾了 50 年来作为学员、同事和竞争对手的经历。

In Part IV, “The Big Picture,” world-leading researchers, engineers, and entrepreneurs reflect on Mike’s contributions in the grander scope of things. Phil Bernstein, a leading researcher and big thinker, reflects on Mike’s leadership and advocacy. James Hamilton, a world-class engineer and former lead architect of DB2—an Elephant—reflects on the value of Turing contributions. Jerry Held, Mike’s first Ph.D. student, now a leading Silicon Valley entrepreneur, recounts experiences collaborating and competing with Mike. Dave DeWitt, comparable to Mike in his data management contributions, reflects on 50 years as a mentee, colleague, and competitor.

第五部分“初创企业”讲述了 21 世纪最热门的故事之一:如何创建、资助和运营一家成功的技术初创企业。正如数据管理技术凯尔米特(参见第 xxvii 页)所示,Mike 与他人共同创立了 9 家初创公司。初创公司的故事是从三个截然不同的角度讲述的:技术创新者和首席技术官(迈克)的观点、首席执行官的观点以及主要资助者的观点。这些不仅仅是快速致富快速成名故事。初创公司及其产品是 Mike Stonebraker 数据库技术研发方法论中不可或缺的组成部分,以确保结果具有影响力。这些页面充斥着行业驱动和行业验证的数据库技术研究主题。如果你从这本书中只得到一件事,那就这样吧。

Part V, “Startups,” tells one of the 21st century’s hottest stories: how to create, fund, and run a successful technology startup. As the Data Management Technology Kairometer (see page xxvii) illustrates, Mike has co-founded nine startups. The startup story is told from three distinctly different points of view: from that of the technical innovator and Chief Technology Officer (Mike), from that of the CEO, and from that of the prime funder. These are not mere get rich quick or get famous quick stories. Startups and their products are an integral component of Mike Stonebraker’s database technology research and development methodology, to ensure that the results have impact. This theme of industry-driven and industry-proven database technology research pervades these pages. If you get only one thing from this book, let this be it.

第六部分“数据库系统研究”将我们带入数据库系统研究的核心。迈克回答:想法从哪里来?如何利用它们?采取一个图灵奖获得者主讲如何进行数据库系统研究的大师班。迈克邀请我写“失败”一章的内容是:那么,迈克,你的真实想法是什么?他令人惊讶的答案提出了数据库系统研究和数据库研究社区面临的挑战。迈克·奥尔森博士 Mike 的学生、Cloudera(拥有基于 Hadoop 的数据管理产品)非常成功的联合创始人兼首席战略官,列出了 Mike 对 Ingres 自由分发后发起的开源运动的贡献。第六部分最后以 Felix Naumann 令人惊叹的《关系数据库管理谱系》作为结尾,它以图形方式描述了从 1970 年至今数百个 RDBMS 的谱系,显示了 RDBMS 在代码、概念和/或开发人员方面的紧密联系。

Part VI, “Database Systems Research,” takes us into the heart of database systems research. Mike answers: Where do ideas come from? How to exploit them? Take a master class with a Turing Award winner on how to do database systems research. Mike’s invitation to write the “Failures” chapter was: So, Mike, what do you really think? His surprising answers lay out challenges facing database systems research and the database research community. Mike Olson, a Ph.D. student of Mike and an extremely successful co-founder and Chief Strategy Officer of Cloudera (which has data management products based on Hadoop), lays out Mike’s contributions relative to the open-source movement that was launched after Ingres was already being freely distributed. Part VI concludes with Felix Naumann’s amazing Relational Database Management Genealogy, which graphically depicts the genealogy of hundreds of RDBMSs from 1970 to the present showing how closely connected RDBMSs are in code, concepts, and/or developers.

第七部分“系统贡献”描述了获得图灵奖的技术贡献。它们按时间顺序在九个项目的背景下呈现,每个项目都以做出贡献的软件系统为中心。每个系统在第 VII.A 部分“系统的研究贡献”中从研究角度进行了描述,并在第 VII.B 部分“建筑系统的贡献”的配套故事中从系统的角度进行了描述。第 14 章概述了主要技术成就,并提供了贡献及其故事的地图。

Part VII, “Contributions by System,” describes the technical contributions for which the Turing award was given. They are presented chronologically in the context of the nine projects, each centered on the software system in which the contributions arose. Each system is described from a research perspective in Part VII.A, “Research Contributions by System,” and from a systems perspective, in a companion story in Part VII.B, “Contributions from Building Systems.” Chapter 14 synopsizes the major technical achievements and offers a map to contributions and their stories.

“系统研究贡献”中的故事重点关注每个特定项目和系统中出现的重大技术成果。他们没有重复书中引用的已发表论文中的技术论点。研究章节解释了技术成就:它们的意义,特别是在当时的技术和应用需求的背景下;以及它们对用于证明这些想法和采用这些概念的技术、系统和产品的价值和影响。大多数故事,例如 Daniel Abadi 的第 18 章讲述在具有挑战性和回报性的研究中做出的职业决定。前七个系统——Ingres、Postgres、Aurora、C-Store、H-Store、SciDB 和 Data Tamer——跨越 1972 年至 2018 年,包括关系型系统,“一刀切” 现在数据管理的大数据时代。所有七个项目都产生了成功的系统、产品和公司。本书中没有包含迈克认为不成功的两个系统。(但如果有些项目没有失败,则表明他还不够努力。)最后两个项目 Big-DAWG 和 Data Civilizer 正在撰写本文,作为 Mike 在大数据世界中的两个愿景。

The stories in “Research Contributions by System” focus on the major technical achievements that arose in each specific project and system. They do not repeat the technical arguments from the already published papers that are all cited in the book. The research chapters explain the technical accomplishments: their significance, especially in the context of the technology and application demands of the time; and their value and impact in the resulting technology, systems, and products that were used to prove the ideas and in those that adopted the concepts. Most stories, like Daniel Abadi’s Chapter 18 tell of career decisions made in the grips of challenging and rewarding research. The first seven systems—Ingres, Postgres, Aurora, C-Store, H-Store, SciDB, and Data Tamer—span 1972–2018, including the relational, “one-size-does-not-fit-all”, and now the Big Data eras of data management. All seven projects resulted in successful systems, products, and companies. Not included in the book are two systems that Mike does not consider successful. (But if some didn’t fail, he wasn’t trying hard enough.) The final two projects, Big-DAWG and Data Civilizer, are under way at this writing as two of Mike’s visions in the Big Data world.

“构建系统的贡献”中的故事有点不寻常,因为它们讲述了软件系统工程中鲜为人知的英雄主义故事。软件系统(例如 DBMS)的开发是一次奇妙而又可怕的经历对于个人、团队和支持者!有很多经常被忽视的戏剧:不可避免的灾难和令人难以置信的成功,以及不断的发现,有时是新的,但往往会无数次地重复。世界上一些最著名的 DBMS(DB2、Ingres、Postgres)的主要团队成员讲述了我们所知的 DB2、Ingres、StreamBase、Vertica、VoltDB 和 SciDB 等代码线的开发故事,以及数据统一系统 Tamr。

The stories in “Contributions from Building Systems” are a little unusual in that they tell seldom-told stories of heroism in software systems engineering. The development of a software system, e.g., a DBMS, is a wonderful and scary experience for individuals, the team, and the backers! There is a lot of often-unsung drama: inevitable disasters and unbelievable successes, and always discoveries, sometimes new but often repeated for the umpteenth time. Key team members of some of the best-known DBMSs in the world (DB2, Ingres, Postgres) tell tales of the development of the codelines we have come to know as DB2, Ingres, StreamBase, Vertica, VoltDB, and SciDB, and the data unification system Tamr.

这些故事的部分动机是我向 IBM DB2 UDB 前首席架构师 James Hamilton 提出的一个偶然问题,它(与 Stonebraker 的 Ingres)是最早的关系 DBMS 之一,并成为一头大象。我问 James:“Jim Gray 给我讲了一些关于 DB2 的有趣故事。到底发生了什么?” 第 32 章“IBM 关系数据库代码库”中讲述了一个非凡的故事,该故事清楚地表明必须讲述所有代码线故事。令我感动的一句话是:“绩效改进项目不是一次惩罚性或无回报的‘长征’,而是我职业生涯中最好的经历之一。” 这来自世界上最好的系统架构师之一,现任亚马逊网络服务副总裁兼杰出工程师。

These stories were motivated, in part, by an incidental question that I asked James Hamilton, former lead architect, IBM DB2 UDB, which (with Stonebraker’s Ingres) was one of the first relational DBMSs and became an Elephant. I asked James: “Jim Gray told me some fascinating stories about DB2. What really happened?” What unfolded was a remarkable story told in Chapter 32, “IBM Relational Database Code Bases,” which made it clear that all the codeline stories must be told. One sentence that got me was: “Instead of being a punishing or an unrewarding ‘long march,’ the performance improvement project was one of the best experiences of my career.” This came from one of the world’s best systems architects, currently, Vice President and Distinguished Engineer, Amazon Web Services.

但更重要的是,这些故事展示了本书的主题和迈克的观察:“系统软件很难构建,以及为什么优秀的团队经常把它搞砸。”

But, more importantly, these stories demonstrate the theme of the book and Mike’s observation that “system software is so hard to build, and why good teams screw it up on a regular basis.”

从 Mike 的数据库职业生涯开始,原型和商业系统的设计、开发、测试和采用一直是他的研究方法和技术贡献的基础。正如他在第 9 章中所说“安格尔之所以产生影响,主要是因为我们坚持不懈,并找到了一个真正可以发挥作用的系统。” 谈到未来的项目,他说:“在每种情况下,我们都会构建一个原型来展示这个想法。在早期(Ingres/Postgres),这些是全功能系统;在后来的日子里(C-Store/H-Store),原型走了很多弯路。” 这些系统用于测试和证明或反驳研究假设,了解工程方面,探索实际用例的细节,并探索解决方案的采用及其影响。结果,研究在一个良性循环中进行,其中研究思想改进了系统,反过来,系统和应用挑战又提出了研究挑战。

From the beginning of Mike’s database career, the design, development, testing, and adoption of prototype and commercial systems have been fundamental to his research methodology and to his technical contributions. As he says in Chapter 9 “Ingres made an impact mostly because we persevered and got a real system to work.” Referring to future projects, he says, “In every case, we built a prototype to demonstrate the idea. In the early days (Ingres/Postgres), these were full-function systems; in later days (C-Store/H-Store) the prototypes cut a lot of corners.” These systems were used to test and prove or disprove research hypotheses, to understand engineering aspects, to explore details of real use case, and to explore the adoption, hence impact, of the solutions. As a result, research proceeded in a virtuous cycle in which research ideas improved systems and, in turn, systems and application challenges posed research challenges.

第八部分“观点”提供了五个个人故事。James Hamilton 讲述了开发世界领先的 DBMS 的过程,其中包括他传奇的工程生涯的亮点。劳尔·卡斯特罗·费尔南德斯 (Raul Castro Fernandez) 讲述了作为斯通布雷克 (Stonebraker) 的博士后,他如何学会如何进行计算机系统研究,以及如何获得研究品味。马蒂·赫斯特讲述了一个引人入胜的故事,讲述了她如何在看似令人生畏但实际上充满关爱的导师的指导下从学生成长为研究员。Don Haderle,IBM Systems R 的产品开发人员,据称是 IBM 的死敌安格尔项目,钦佩地谈到了成为合作者和朋友的竞争对手。在本书的最后一个故事中,我讲述了 1974 年第一次见到 Mike Stonebraker 的情景。直到我在这个故事中反思,1974 年 SIGMOD(数据管理特别兴趣小组)会议召开之前,我才意识到,该会议主办了许多会议。预期的 CODASYL-关系型辩论标志着数据库卫士的变化:不仅将数据库研究的领导地位从导航 DBMS 的创建者转移到新的领导者(例如 Mike Stonebraker),而且还预示着导航数据库技术的衰落和关系数据库技术的兴起。数据库技术的后续时代。

Part VIII, “Perspectives,” offers five personal stories. James Hamilton recounts developing one of the world’s leading DBMSs, which included the highlights of his storied engineering career. Raul Castro Fernandez recounts how, as a Stonebraker postdoc, he learned how to do computer systems research—how he gained research taste. Marti Hearst tells an engaging story of how she matured from a student to a researcher under a seemingly intimidating but actually caring mentor. Don Haderle, a product developer on IBM’s Systems R, the alleged sworn enemy of the Ingres project, speaks admiringly of the competitor who became a collaborator and friend. In the final story of the book, I recount meeting Mike Stonebraker for the first time in 1974. I had not realized until I reflected for this story that the 1974 pre-SIGMOD (Special Interest Group on Management of Data) conference that hosted the much anticipated CODASYL-Relational debate marked a changing of the database guard: not just shifting database research leadership from the creators of navigational DBMSs to new leaders, such as Mike Stonebraker, but also foreshadowing the decline of navigational database technology and the rise of the relational and subsequent eras of database technology.

第九部分“Michael Stonebraker 及其合作者的开创性作品”重印了六篇论文,与第一章中的 2014 年 ACM AM 图灵奖论文一起,构成了介绍 Mike 最重要技术成就的论文。与本书中的大多数故事一样,这些开创性的作品应该在其出版时的技术和挑战的背景下阅读。这种背景正是第七部分中相应的研究和系统故事所提供的。到目前为止,这些论文缺乏这样的背景信息,现在是由那些对这些贡献至关重要的贡献者提供的,并从 2018 年的角度讲述。

Part IX, “Seminal Works of Michael Stonebraker and His Collaborators,” reprints the six papers that, together with the 2014 ACM A.M. Turing Award Paper in Chapter 1, constitute the papers that present Mike’s most significant technical achievements. Like most of the stories in this book, these seminal works should be read in the context of the technology and challenges at the time of their publication. That context is exactly what the corresponding research and systems stories in Part VII provide. Until now, those papers lacked that context, now given by contributors who were central to those contributions and told from the perspective of 2018.

Mike 的 7 篇开创性论文是从他在职业生涯中撰写或合着的约 340 篇论文中选出的。迈克的出版物列于“迈克尔·斯通布雷克合集”第 607 页。

Mike’s 7 seminal papers were chosen from the approximately 340 that he has authored or co-authored in his career. Mike’s publications are listed in “Collected Works of Michael Stonebraker,” page 607.

1 . “我必须发表,但我的论文主题毫无进展。”——迈克·斯通布雷克

1. “I had to publish, and my thesis topic was going nowhere.”—Mike Stonebraker

2 . http://www.statista.com/statistics/724611/worldwide-database-market/

2. http://www.statista.com/statistics/724611/worldwide-database-market/

3 . 本章引用的数字截至 2018 年中期。

3. Numbers quoted in this chapter are as of mid-2018.

4 . http://www.eversql.com/most-popular-databases-in-2018-according-to-stackoverflow-survey/

4. http://www.eversql.com/most-popular-databases-in-2018-according-to-stackoverflow-survey/

5 . http://db-engines.com/en/ranking

5. http://db-engines.com/en/ranking

6 . 专家导游让这次旅行变得值得,他的出身是罗马政治家和演说家马库斯·图利乌斯·西塞罗,以指导复杂的政治论文而闻名,如今他的职责已减少到餐馆。

6. Expert tourist guide, who makes the trip worthwhile, derived from Marcus Tullius Cicero, Roman politician and orator known for guiding through complex political theses, diminished today to restaurants.

7 . http://www.youtube.com/watch?v=BbGeKi6T6QI

7. http://www.youtube.com/watch?v=BbGeKi6T6QI

8 . 有关照片,请参见http://stonebraker70.com

8. For photographs see http//stonebraker70.com.

第一部分

PART I

2014 ACM AM 图灵奖论文和讲座

2014 ACM A.M. TURING AWARD PAPER AND LECTURE

陆鲨在 Squawk Box 上

The Land Sharks Are on the Squawk Box

迈克尔·斯通布雷克

Michael Stonebraker

事实证明,骑行穿越美国不仅仅是构建系统软件的一个方便的比喻

It turns out riding across America is more than a handy metaphor for building system software.

——迈克尔·斯通布雷克

—Michael Stonebraker

我,肯尼巴哥,1993 年夏天。“Land Sharks”正在大声喊叫,Illustra(将 Postgres 商业化的公司)已经陷入困境,而我正在与投资者举行电话会议,试图获得更多资金。唯一的问题是,我在缅因州我哥哥的钓鱼小屋参加家庭活动,而投资者则在加利福尼亚州使用免提电话(扬声器)。我们八个人住在狭窄的房间里,我在浴室里扎营,试图谈判达成协议。这段对话令人沮丧地熟悉。他们说更多的钱、更低的价格;我说的是钱少价高。我们最终达成了口头握手,Illustra 将继续战斗。

KENNEBAGO, ME, SUMMER 1993. The “Land Sharks” are on the squawk box, Illustra (the company commercializing Postgres) is down to fumes, and I am on a conference call with the investors to try to get more money. The only problem is I am in Maine at my brother’s fishing cabin for a family event while the investors are on a speakerphone (the squawk box) in California. There are eight of us in cramped quarters, and I am camped out in the bathroom trying to negotiate a deal. The conversation is depressingly familiar. They say more-money-lower-price; I say less-money-higher-price. We ultimately reach a verbal handshake, and Illustra will live to fight another day.

与鲨鱼谈判总是令人沮丧。他们非常擅长讨价还价。毕竟,这就是他们整天所做的事情。相比之下,我感觉自己就像一个森林里的婴儿。

Negotiating with the sharks is always depressing. They are superb at driving a hard bargain; after all, that is what they do all day. I feel like a babe in the woods by comparison.

本文穿插了两个故事(见图1)。第一个是 1988 年夏天,我和妻子贝丝 (Beth) 进行的一次越野自行车骑行;第二个是Postgres的设计、构建和商业化,历时12年,从20世纪80年代中期到90年代中期。讲完这两个故事后,我将得出一系列的观察和结论。

This article interleaves two stories (see Figure 1). The first is a cross-country bike ride my wife Beth and I took during the summer of 1988; the second is the design, construction, and commercialization of Postgres, which occurred over a 12-year period, from the mid-1980s to the mid-1990s. After telling both stories, I will draw a series of observations and conclusions.

重要见解

Key Insights

• 解释了Postgres 设计决策背后的动机,以及遇到的“减速带”。

•  Explained is the motivation behind Postgres design decisions, as are “speedbumps” encountered.

• 骑自行车穿越美国和构建计算机软件系统都是漫长而艰难的事情,一路上不断考验个人的毅力。

•  Riding a bicycle across America and building a computer software system are both long and difficult affairs, constantly testing personal fortitude along the way.

• 偶然性在这两项努力中发挥了重要作用。

•  Serendipity played a major role in both endeavors.

良好的开端

Off to a Good Start

华盛顿州阿纳科特斯,1988 年 6 月 3 日。我们的车里挤满了人,我们四个人(贝丝、我们 18 个月大的女儿莱斯利、我们的司机兼保姆玛丽·安妮和我)挤在车里。这是充满压力的一天。屋顶上是这一切的根源——我们全新的双人自行车。我们花了一个下午的时间在西雅图自行车店修理它。在从湾区出发的路上,玛丽·安妮开车进入了一个比汽车加自行车高度还低的停车场。值得庆幸的是,损坏已经修复,我们已经准备好出发了,尽管有点疲惫。明天早上,贝丝和我将沿着北喀斯喀特风景公路向东行驶;我们的目的地距离马萨诸塞州波士顿约 3,500 英里。因此,我们将我们的自行车命名为“Boston Bound”。

Anacortes, WA, June 3, 1988. Our car is packed to the gills, and the four of us (Beth; our 18-month-old daughter Leslie; Mary Anne, our driver and babysitter; and me) are squished in. It has been a stressful day. On the roof is the cause of it all—our brand-new tandem bicycle. We spent the afternoon in Seattle bike shops getting it repaired. On the way up from the Bay Area, Mary Anne drove into a parking structure lower than the height of the car plus the bike. Thankfully, the damage is repaired, and we are all set to go, if a bit frazzled. Tomorrow morning, Beth and I will start riding east up the North Cascades Scenic Highway; our destination, some 3,500 miles away, is Boston, MA. We have therefore christened our bike “Boston Bound.”

我们只骑过一次双人自行车,也从未骑过超过五天的自行车旅行,这并不让我们感到困扰。事实上,我们从未爬过像我们面前的那些山一样,这同样令人畏惧。贝丝和我兴高采烈;我们正在开始一场伟大的冒险。

It does not faze us that we have been on a tandem bike exactly once, nor that we have never been on a bike trip longer than five days. The fact we have never climbed mountains like the ones directly in front of us is equally undaunting. Beth and I are in high spirits; we are starting a great adventure.

加利福尼亚州伯克利,1984 年。十年来我们一直在研究 Ingres。首先,我们建立了一个学术原型,然后使其功能齐全,然后创办了一家商业公司。然而,四年前的 1980 年,Ingres 公司开始使用我们的开源代码库,现在已经取得了巨大的进步,它的代码现在远远优于学术版本。继续在我们的软件上进行原型设计没有任何意义。将代码推下悬崖是一个痛苦的决定,但此时一个新的 DBMS 就诞生了。那么 Postgres 会是什么样呢?

Berkeley, CA, 1984. We have been working on Ingres for a decade. First, we built an academic prototype, then made it fully functional, and then started a commercial company. However, Ingres Corporation, which started with our open source code base four years ago in 1980, has made dramatic progress, and its code is now vastly superior to the academic version. It does not make any sense to continue to do prototyping on our software. It is a painful decision to push the code off a cliff, but at that point a new DBMS is born. So what will Postgres be?

有一点是明确的:Postgres 将挑战数据类型的极限。到目前为止,我已经读了十几篇这样的论文:“关系模型很棒,所以我尝试了[选择一个垂直应用程序]。我发现它不起作用,为了解决这个问题,我建议我们在关系模型中添加[一些新想法]。”

One thing is clear: Postgres will push the envelope on data types. By now I have read a dozen papers of the form: “The relational model is great, so I tried it on [pick a vertical application]. I found it did not work, and to fix the situation, I propose we add [some new idea] to the relational model.”

图像

华盛顿州阿纳科特斯:第一天 – 1988 年 6 月 4 日

Anacortes, WA: Day 1 – June 4, 1988

一些选择的垂直领域是地理信息系统(GIS)、计算机辅助设计(CAD)和图书馆信息系统。我很清楚,如果我们以这种方式添加随机功能,干净、简单的关系模型将变得一团糟。人们可以将此视为“死于 100 个疣”。

Some chosen verticals were geographic information systems (GISs), computer-aided design (CAD), and library information systems. It was pretty clear to me that the clean, simple relational model would turn into a complete mess if we added random functionality in this fashion. One could think of this as “death by 100 warts.”

基本问题是现有的关系系统(特别是 Ingres 和 System R)在设计时就考虑到了业务数据处理用户。毕竟,那是当时的主要 DBMS 市场,并且两个开发人员集合都试图在这个流行的用例上比现有的竞争对手(即 IMS 和 Codasyl)做得更好。我们从未想过要看看其他市场,因此关系数据库管理系统不擅长这些。然而,由 Pravin Varaiya 教授领导的加州大学伯克利分校的一个研究小组在 Ingres 之上构建了一个 GIS,我们亲眼目睹了它有多么痛苦。在 Ingres 中模拟浮点数、整数和字符串之上的点、线、多边形和线组并不漂亮。

The basic problem was the existing relational systems—specifically Ingres and System R—were designed with business data processing users in mind. After all, that was the major DBMS market at the time, and both collections of developers were trying to do better than the existing competition, namely IMS and Codasyl, on this popular use case. It never occurred to us to look at other markets, so RDBMSs were not good at them. However, a research group at the University of California at Berkeley, headed by Professor Pravin Varaiya, built a GIS on top of Ingres, and we saw firsthand how painful it was. Simulating points, lines, polygons, and line groups on top of the floats, integers, and strings in Ingres was not pretty.

我很清楚,必须支持适合应用程序的数据类型,并且需要用户定义的数据类型。这个想法之前已经被编程语言社区在 EL1 等系统中进行过研究,所以我所拥有的一切要做的就是将其应用到关系模型中。例如,考虑以下对工资的 SQL 更新,存储为整数

It was clear to me that one had to support data types appropriate to an application and that required user-defined data types. This idea had been investigated earlier by the programming language community in systems like EL1, so all I had to do was apply it to the relational model. For example, consider the following SQL update to a salary, stored as an integer

更新员工集(工资 = 工资 + 1000),其中姓名 = 'George'

Update Employee set (salary = salary + 1000) where name = ‘George’

要处理它,必须使用库函数 string-to-integer 将字符串 1000 转换为整数,然后调用 C 库中的 integer + 例程。要使用新类型(例如 )支持此命令foobar,只需添加两个函数foobar-plusstring-to-foobar,然后在适当的时间调用它们。添加新的 DBMS 命令非常简单,ADDTYPE其中包含新数据类型的名称以及与 ASCII 之间的来回转换例程。对于这种新类型上的每个所需运算符,可以添加该运算符的名称以及要调用以应用它的代码。

To process it, one must convert the character string 1000 to an integer using the library function string-to-integer and then call the integer + routine from the C library. To support this command with a new type, say, foobar, one must merely add two functions, foobar-plus and string-to-foobar, and then call them at the appropriate times. It was straightforward to add a new DBMS command, ADDTYPE, with the name of the new data type and conversion routines back and forth to ASCII. For each desired operator on this new type, one could add the name of the operator and the code to call to apply it.

当然,魔鬼总是在细节中。人们必须能够使用 B 树或散列来索引新的数据类型。

The devil is, of course, always in the details. One has to be able to index the new data type using B-trees or hashing.

索引需要小于和等于的概念。此外,需要交换性和结合性规则来决定新类型如何与其他类型一起使用。最后,还必须处理以下形式的谓词:

Indexes require the notion of less-than and equality. Moreover, one needs commutativity and associativity rules to decide how the new type can be used with other types. Lastly, one must also deal with predicates of the form:

not salary < 100

not salary < 100

这是合法的 SQL,每个 DBMS 都会将其翻转为

This is legal SQL, and every DBMS will flip it to

salary ≥ 100

salary ≥ 100

因此必须为每个运算符定义一个否定器,因此这种优化是可能的。

So one must define a negator for every operator, so this optimization is possible.

我们在 Ingres [ 8 ]中对此功能进行了原型设计,并且它似乎可以工作,因此抽象数据类型 (ADT) 的概念显然将成为 Postgres 的基石。

We had prototyped this functionality in Ingres [8], and it appeared to work, so the notion of abstract data types (ADTs) would clearly be a cornerstone of Postgres.

华盛顿州温斯罗普,第三天。当我躺在汽车旅馆房间的床上时,我的腿在抽痛。事实上,我从臀部以下酸痛,但很高兴。我们今天早上5点就开始骑行了;我们艰难地爬了50英里的电线杆。一路上,我们上升到了 5,000 英尺的喀斯喀特山脉,穿上了我们带来的每一件衣服。即便如此,我们还是没有为山口顶部附近的暴风雪做好准备。又冷又湿又累,我们终于到达了雷尼山口的山顶。经过短暂的下坡后,我们又爬了 1,000 英尺,到达了华盛顿山口的山顶。然后是光荣地降落到温思罗普。我现在虽然精疲力尽,但精神抖擞;还有很多山口需要攀登,但我们已经完成了前两个山口。我们已经证明我们可以爬山。

Winthrop, WA, Day 3. My legs are throbbing as I lay on the bed in our motel room. In fact, I am sore from the hips down but elated. We have been riding since 5 a.m. this morning; telephone pole by telephone pole we struggled uphill for 50 miles. Along the way, we rose 5,000 feet into the Cascades, putting on every piece of clothing we brought with us. Even so, we were not prepared for the snowstorm near the top of the pass. Cold, wet, and tired, we finally arrived at the top of the aptly named Rainy Pass. After a brief downhill, we climbed another 1,000 feet to the top of Washington Pass. Then it was glorious descent into Winthrop. I am now exhausted but in great spirits; there are many more passes to climb, but we are over the first two. We have proved we can do the mountains.

加利福尼亚州伯克利,1985 年至 1986 年。Chris Date 于 1981 年撰写了一篇关于引用完整性的开创性论文 [ 1 ],其中定义了这一概念并指定了执行该概念的规则。基本上,如果有人有一张桌子

Berkeley, CA, 1985–1986. Chris Date wrote a pioneering paper [1] on referential integrity in 1981 in which he defined the concept and specified rules for enforcing it. Basically, if one has a table

图像

图 1  两条时间线:越野自行车骑行和 Illustra/Postgres 开发。

Figure 1  The two timelines: Cross-country bike ride and Illustra/Postgres development.

Employee (name, salary, dept, age) with primary key “name”

Employee (name, salary, dept, age) with primary key “name”

和第二张桌子

and a second table

Dept (dname, floor) with a primary key “dname”

Dept (dname, floor) with a primary key “dname”

那么Employee中的属性dept就是外键;也就是说,它引用了另一个表中的主键;图 2显示了这两个表的示例。在这种情况下,如果从 dept 表中删除一个部门会发生什么?

then the attribute dept in Employee is a foreign key; that is, it references a primary key in another table; an example of these two tables is shown in Figure 2. In this case, what happens if one deletes a department from the dept table?

例如,删除糖果部门将在 Employee 表中为在现已删除的部门工作的每个人留下悬空引用。Date 确定了关于如何处理插入和删除的六种情况,所有这些都可以通过相当原始的 if-then 规则系统来指定。在查看了 Prolog 和 R1 中的程序后,我对这种方法非常怀疑。观察任何超过 10 条语句的规则程序,很难弄清楚它的作用。此外,这些规则是程序性的,根据规则调用的顺序,人们可能会得到各种奇怪的行为。例如,考虑以下两个(有点滑稽的)规则:

For example, deleting the candy department will leave a dangling reference in the Employee table for everybody who works in the now-deleted department. Date identified six cases concerning what to do with insertions and deletions, all of which can be specified by a fairly primitive if-then rule system. Having looked at programs in Prolog and R1, I was very leery of this approach. Looking at any rule program with more than 10 statements, it is very difficult to figure out what it does. Moreover, such rules are procedural, and one can get all kinds of weird behavior depending on the order in which rules are invoked. For example, consider the following two (somewhat facetious) rules:

If Employee.name = ‘George’

Then set Employee.dept = ‘shoe’

If Employee.salary > 1000 and Employee.dept = ‘candy’

If Employee.name = ‘George’

Then set Employee.dept = ‘shoe’

If Employee.salary > 1000 and Employee.dept = ‘candy’

图像

图 2  说明数据用户为何需要参照完整性的相关数据。

Figure 2  Correlated data illustrating why data users need referential integrity.

Then set Employee.salary = 1000

Then set Employee.salary = 1000

考虑从鞋类部门移动George到糖果部门并将其工资更新为 的更新2000。根据两条规则的处理顺序,将得到不同的最终答案。值得注意的是,如果规则按照此处的顺序执行,那么乔治最终的薪水将为2000;如果规则顺序颠倒,那么他的最终工资将为1000。拥有依赖于顺序的规则语义是非常糟糕的。

Consider an update that moves George from the shoe department to the candy department and updates his salary to 2000. Depending on the order the two rules are processed, one will get different final answers. Notably, if the rules are executed in the order here, then George will ultimately have a salary of 2000; if the rule order is reversed, then his ending salary will be 1000. Having order-dependent rule semantics is pretty awful.

关系模型的基本原则是查询的评估顺序(包括访问记录的顺序)由系统决定。因此,无论选择执行哪种查询计划,都应该始终给出相同的最终结果。可以想象,构建为不同查询计划提供不同答案的规则集合是微不足道的,这显然是不受欢迎的系统行为。

A fundamental tenet of the relational model is the order of evaluation of a query, including the order in which records are accessed, is up to the system. Hence, one should always give the same final result, regardless of the query plan chosen for execution. As one can imagine, it is trivial to construct collections of rules that give different answers for different query plans—obviously undesirable system behavior.

几年来我花了很多时间寻找其他东西。最终,我的首选方法是向查询语言添加关键字always。因此,查询语言中的任何话语都应该具有看起来持续运行的语义。例如,如果 Mike 必须与 Sam 拥有相同的薪水,那么下面的always命令就可以解决问题

I spent many hours over a couple of years looking for something else. Ultimately, my preferred approach was to add a keyword always to the query language. Hence, any utterance in the query language should have the semantics that it appears to be continually running. For example, if Mike must have the same salary as Sam, then the following always command will do the trick

Always update Employee, E

set salary = E.salary

where Employee.name = ‘Mike’ and E.name = ‘Sam’

Always update Employee, E

set salary = E.salary

where Employee.name = ‘Mike’ and E.name = ‘Sam’

每当 Mike 收到工资调整时,此命令就会启动并将他的工资重置为 Sam 的工资。每当 Sam 获得加薪时,它就会传播给 Mike。Postgres 会让这个始终命令并避免(部分)if-then 规则系统的丑陋。这是个好消息;Postgres 会尝试一些可能可行的不同方法。

Whenever Mike receives a salary adjustment, this command will kick in and reset his salary to that of Sam. Whenever Sam gets a raise, it will be propagated to Mike. Postgres would have this always command and avoid (some of) the ugliness of an if-then rules system. This was great news; Postgres would try something different that has the possibility of working.

图像

蒙大拿州玛丽亚斯帕斯:第 15 天

Marias Pass, MT: Day 15

蒙大拿州玛丽亚斯帕斯,第 15 天。我不相信。我们拐过一个拐角,看到了山口顶部的标志。我们正处于大陆分水岭!喀斯喀特山脉和落基山脉的无尽攀登已在我们身后,我们可以看到大平原在我们面前延伸。现在正在下坡去芝加哥!为了庆祝这一里程碑,我们将一小瓶自阿纳科特斯以来一直携带的太平洋海水倒入山口东侧,最终流入墨西哥湾。

Marias Pass, MT, Day 15. I cannot believe it. We round a corner and see the sign for the top of the pass. We are at the Continental Divide! The endless climbs in the Cascades and the Rockies are behind us, and we can see the Great Plains stretching out in front of us. It is now downhill to Chicago! To celebrate this milestone, we pour a small vial of Pacific Ocean water we have been carrying since Anacortes to the east side of the pass where it will ultimately flow into the Gulf of Mexico.

加利福尼亚州伯克利,1986 年。我在 Ingres 方面的经验让我确信,用于恢复目的的数据库日志是乏味且难以编码的。事实上,黄金标准规范在 C. Mohan 等人的著作中。[ 3 ]。此外,DBMS 实际上是两个 DBMS,一个管理我们所知的数据库,另一个管理日志,如图3所示。日志是实际的记录系统,因为 DBMS 的内容可能会丢失。我们在 Postgres 中探索的想法是支持时间旅行。我们是否可以保留旧记录并在实际数据库中写入包含新内容的第二条记录,而不是就地更新数据记录,然后将新内容和旧内容都写入日志中?这样,日志将被合并到普通数据库中,并且不需要单独的日志处理,如图4所示。这种架构的一个附带好处是能够支持时间旅行,因为旧记录可以在数据库中轻松查询。最后,标准会计系统在记录保存方法中不使用覆盖,因此 Postgres 将与这种策略兼容。

Berkeley, CA, 1986. My experience with Ingres convinced me a database log for recovery purposes is tedious and difficult to code. In fact, the gold standard specification is in C. Mohan et al. [3]. Moreover, a DBMS is really two DBMSs, one managing the database as we know it and a second one managing the log, as in Figure 3. The log is the actual system of record, since the contents of the DBMS can be lost. The idea we explored in Postgres was to support time travel. Instead of updating a data record in place and then writing both the new contents and the old contents into the log, could we leave the old record alone and write a second record with the new contents in the actual database? That way the log would be incorporated into the normal database and no separate log processing would be required, as in Figure 4. A side benefit of this architecture is the ability to support time travel, since old records are readily queryable in the database. Lastly, standard accounting systems use no overwrite in their approach to record keeping, so Postgres would be compatible with this tactic.

图像

图 3  传统 DBMS 崩溃恢复。

Figure 3  Traditional DBMS crash recovery.

图像

图 4   Postgres 图片:无覆盖。

Figure 4  Postgres picture: No overwrite.

从高层次来看,Postgres 将在三个领域做出贡献:ADT 系统、基于always 命令的干净规则系统以及时间旅行存储系统。Stonebraker 和 Rowe [ 6 , 7 ]中描述了大部分功能。有关 Postgres 范围的更多信息,可以查阅庆祝我 70 岁生日座谈会的视频记录[ 2 ]。我们开始执行一个有趣的技术计划。

At a high level, Postgres would make contributions in three areas: an ADT system, a clean rules system based on the always command, and a time-travel storage system. Much of this functionality is described in Stonebraker and Rowe [6,7]. For more information on the scope of Postgres, one can consult the video recording of the colloquium celebrating my 70th birthday [2]. We were off and running with an interesting technical plan.

第一个减速带

First Speedbumps

北达科他州德雷克,第 26 天。我们真的很郁闷。北达科他州一片黯淡。最近几天的单调如下:

Drake, ND, Day 26. We are really depressed. North Dakota is bleak. The last few days have been the following monotony:

图像

北达科他州德雷克:第 26 天

Drake, ND: Day 26

看到前面的谷物电梯,它代表着下一个城镇

See the grain elevator ahead that signifies the next town

乘车一个小时走向电梯

Ride for an hour toward the elevator

几分钟后穿过小镇

Pass through the town in a few minutes

看看下一个谷物升降机……

See the next grain elevator …

然而,杀死我们的并不是树木的缺失(我们开玩笑说北达科他州的州树是电线杆)和荒凉的风景。通常情况下,人们可以直接坐在马鞍上,然后被盛行风吹过该州,盛行风通常是从西向东呼啸而过。他们确实在嚎叫,但今年夏天的天气很反常。我们正在经历从东向西吹的强风,直吹我们的脸。虽然我们预计会以 17-18 英里/小时的速度行驶,但我们正在努力达到 7 英里。今天我们只行驶了 51 英里,已经筋疲力尽了。我们的目的地是哈维,距离还有 25 英里,但我们不会到达。更不祥的是,林木线(和明尼苏达州边界)仍然有 250 英里远,我们不确定如何到达那里。

However, it is not the absence of trees (we joke the state tree of North Dakota is the telephone pole) and the bleak landscape that is killing us. Normally, one can simply sit up straight in the saddle and be blown across the state by the prevailing winds, which are typically howling from west to east. They are howling all right, but the weather this summer is atypical. We are experiencing gale-force winds blowing east to west, straight in our faces. While we are expecting to be blown along at 17–18 miles per hour, we are struggling hard to make 7. We made only 51 miles today and are exhausted. Our destination was Harvey, still 25 miles away, and we are not going to make it. More ominously, the tree line (and Minnesota border) is still 250 miles away, and we are not sure how we will get there. It is all we can do to refuse a ride from a driver in a pickup truck offering to transport us down the road to the next town.

食物也变得有问题。早餐很可靠。我们找到一个城镇,然后寻找皮卡车最多的咖啡馆(通常是唯一的一家)。我们按照所有此类餐厅的标准菜单进餐。然而,晚餐却变得很无聊。有标准的油炸菜单;我们渴望意大利面和沙拉,但菜单上从来没有这些。

The food is also becoming problematic. Breakfast is dependable. We find a town, then look for the café (often the only one) with the most pickup trucks. We eat from the standard menu found in all such restaurants. However, dinner is getting really boring. There is a standard menu of fried fare; we yearn for pasta and salad, but it is never on the menu.

我们已经建立了惯例。每天的气温都在 80 或 90 华氏度,所以 Beth 和我早上 5 点就上路了,Mary Anne 和 Leslie 起得很晚;他们在汽车旅馆附近闲逛,然后在通往城镇的路上经过我们,我们将在那里过夜。当我们到达新的汽车旅馆时,我们中的一个人去安慰玛丽·安妮,而另一个人则试图找到有我们愿意吃的食物的地方。尽管我们有露营装备,但在路上辛苦一天后,一想到要使用气垫就没什么吸引力了。事实上,我们从不露营。Leslie 很高兴地适应了这种习惯,18 个月大时她最喜欢的词之一就是“制冰机”。我们的目标是在平地每天行驶 80 英里,在山区每天行驶 60 英里。我们每周骑行六天。

We have established a routine. It is in the 80s or 90s Fahrenheit every day, so Beth and I get on the road by 5 a.m. Mary Anne and Leslie get up much later; they hang around the motel, then pass us on the road going on to the town where we will spend the night. When we arrive at the new motel, one of us relieves Mary Anne while the other tries to find someplace with food we are willing to eat. Although we have camping equipment with us, the thought of an air mattress after a hard day on the road is not appealing. In fact, we never camp. Leslie has happily accommodated to this routine, and one of her favorite words, at 18-months old, is “ice machine.” Our goal is 80 miles a day in the flats and 60 miles a day in the mountains. We ride six days per week.

加利福尼亚州伯克利,1986 年date在一位 Ingres 客户实施后不久,我与他进行了交谈time作为一种新的数据类型(根据美国国家标准协会规范)。他说:“你错误地实现了这种新数据类型。” 实际上,他想要一种与标准公历所支持的时间概念不同的时间概念。更准确地说,他计算了华尔街类型金融债券的利息,无论一个月有多长,债券的所有者都可以获得相同数额的利息。也就是说,他想要一个债券时间的概念,其中 3 月 15 日减去 2 月 15 日始终为 30 天,并且每年分为 30 天的月份。从操作上来说,他只是想用自己的想法来超载时间减法。当然,这在 Ingres 中是不可能的,但在 Postgres 中很容易做到。这证明我们的 ADT 是个好主意。

Berkeley, CA, 1986. I had a conversation with an Ingres customer shortly after he implemented date and time as a new data type (according to the American National Standards Institute specification). He said, “You implemented this new data type incorrectly.” In effect, he wanted a different notion of time than what was supported by the standard Gregorian calendar. More precisely, he calculated interest on Wall Street-type financial bonds, which give the owner the same amount of interest, regardless of how long a month is. That is, he wanted a notion of bond time in which March 15 minus February 15 is always 30 days, and each year is divided into 30-day months. Operationally, he merely wanted to overload temporal subtraction with his own notion. This was impossible in Ingres, of course, but easy to do in Postgres. It was a validation that our ADTs are a good idea.

加利福尼亚州伯克利,1986 年。我和我的搭档“葡萄酒鉴赏家”就 Postgres 数据模型进行了近一年的讨论。考虑Employee-Dept前面提到的数据库。一个明显的查询是连接两个表,例如查找员工的姓名和楼层编号,如以下 SQL 命令所示:

Berkeley, CA, 1986. My partner, the “Wine Connoisseur,” and I have had a running discussion for nearly a year about the Postgres data model. Consider the Employee-Dept database noted earlier. An obvious query is to join the two tables, to, say, find the names and floor number of employees, as noted in this SQL command:

Select E.name, D.floor

From Employee E, Dept D

Where E.dept = D.dname

Select E.name, D.floor

From Employee E, Dept D

Where E.dept = D.dname

在编程语言中,此任务将按照类似的程序进行编码(请参阅代码部分 1)。

In a programming language, this task would be coded procedurally as something like (see code section 1).

程序员编写算法来找到所需的结果。相比之下,关系模型的一个原则是程序员应该陈述他们想要什么,而不必编写搜索算法。这项工作落到了查询优化器的肩上,它必须(大规模地)决定是Employee首先迭代还是遍历Dept,还是在连接键上对两个表进行哈希处理,或者对两个表进行排序以进行合并或……

A programmer codes an algorithm to find the desired result. In contrast, one tenet of the relational model is programmers should state what they want without having to code a search algorithm. That job falls to the query optimizer, which must decide (at scale) whether to iterate over Employee first or over Dept or to hash both tables on the join key or sort both tables for a merge or …

我的 Ingres 经验让我确信优化器确实很困难,任何数据库公司的脑外科医生几乎肯定是优化器专家。现在我们正在考虑扩展关系模型以支持更复杂的类型。在最一般的形式中,我们可以考虑一个列,其字段是指向结构数组的指针……我无法专心为如此复杂的东西设计查询优化器。另一方面,我们应该丢弃什么?最后,当我们选择具有基本复杂对象的设计点时,葡萄酒鉴赏家和我都很沮丧。仍然有很多代码来支持我们选择的概念。

My Ingres experience convinced me optimizers are really difficult, and the brain surgeon in any database company is almost certainly the optimizer specialist. Now we were considering extending the relational model to support more complex types. In its most general form, we could consider a column whose fields were pointers to arrays of structures of … I could not wrap my brain around designing a query optimizer for something this complex. On the other hand, what should we discard? In the end, The Wine Connoisseur and I are depressed as we choose a design point with rudimentary complex objects. There is still a lot of code to support the notion we select.

加利福尼亚州伯克利,1987 年。Postgres 中时间旅行的设计是在 Stonebraker [ 5 ] 中。尽管这在理论上是一个优雅的构造,但要使其在实践中表现良好却很棘手。基本问题是图3传统架构中的两个数据库优化方式差异很大。数据是“读优化”的,因此查询速度很快,而日志是“写优化”的,因此可以快速提交事务。Postgres 必须尝试在单个存储中实现这两个目标;例如,如果在一个事务中更新了 10 条记录,则 Postgres 必须强制将提交时发生这些记录的所有页面写入磁盘。否则,DBMS 可能会出现“失忆症”,这是完全禁忌的。传统日志将所有日志记录分组到一个小的页面集合中,而数据记录保持读取优化。由于我们将这两种结构组合成一个存储结构,因此我们必须解决一个棘手的记录放置问题才能实现这两个目标,而我们最初的实现并不是很好。我们花了很多时间尝试修复这个子系统。

Berkeley, CA, 1987. The design of time travel in Postgres is in Stonebraker [5]. Although this is an elegant construct in theory, making it perform well in practice is tricky. The basic problem is the two databases in the traditional architecture of Figure 3 are optimized very differently. The data is “read-optimized” so queries are fast, while the log is “write-optimized” so one can commit transactions rapidly. Postgres must try to accomplish both objectives in a single store; for example, if 10 records are updated in a transaction, then Postgres must force to disk all the pages on which these records occurred at commit time. Otherwise, the DBMS can develop “amnesia,” a complete no-no. A traditional log will group all the log records on a small collection of pages, while the data records remain read-optimized. Since we are combining both constructs into one storage structure, we have to address a tricky record placement problem to try to achieve both objectives, and our initial implementation is not very good. We spend a lot of time trying to fix this subsystem.

加利福尼亚州伯克利,1987 年。葡萄酒鉴赏家和我用 C 语言编写了 Ingres,并且不想再使用它。这听起来太似曾相识了。然而C++还不够成熟,其他语言处理器也不能在Unix上运行。到了这个时候,任何改变操作系统而不是 Unix 的想法都已不再可行。所有伯克利的学生都在接受 Unix 培训,它很快就成为通用的学术操作系统。所以我们选择喝人工智能的酷爱饮料并开始用 Lisp 编写 Postgres。

Berkeley, CA, 1987. The Wine Connoisseur and I had written Ingres in C and did not want to use it again. That sounded too much like déjà vu. However, C++ was not mature enough, and other language processors did not run on Unix. By this time, any thought of changing operating systems away from Unix was not an option; all the Berkeley students were being trained on Unix, and it was quickly becoming the universal academic operating system. So we elected to drink the artificial intelligence Kool-Aid and started writing Postgres in Lisp.

代码第 1 节。

Code Section 1.

图像

一旦我们运行了 Postgres 的基本版本,我们就看到了这是一个多么灾难性的性能错误——绝对所有事情都至少有一个数量级的性能损失。我们立即将部分代码库扔下了悬崖,并将其他所有内容都转换为 C。我们又回到了似曾相识的状态(用 C 进行编码),虽然浪费了很多时间,但至少我们学到了一个重要的教训:不要跳跃不先将脚趾浸入未知的水中。这是几次重大代码重写中的第一次。

Once we had a rudimentary version of Postgres running, we saw what a disastrous performance mistake this was—at least one-order-of-magnitude performance penalty on absolutely everything. We immediately tossed portions of the code base off the cliff and converted everything else to C. We were back to déjà vu (coding in C), having lost a bunch of time, but at least we had learned an important lesson: Do not jump into unknown water without dipping your toe in first. This was the first of several major code rewrites.

加利福尼亚州伯克利,1988 年。不幸的是,我无法找到一种方法让我们的always命令足够通用,至少能够涵盖Chris Date的六个参照完整性案例。经过几个月的尝试,我放弃了,我们决定回到更传统的规则系统。更多代码越过悬崖,需要编写更多新功能。

Berkeley, CA, 1988. Unfortunately, I could not figure out a way to make our always command general enough to at least cover Chris Date’s six referential integrity cases. After months of trying, I gave up, and we decided to return to a more conventional rule system. More code over the cliff, and more new functionality to write.

总之,多年来我们一直在努力实现 Postgres 最初的想法。我记得这一次是一次漫长的“艰难穿越沼泽”。

In summary, for several years we struggled to make good on the original Postgres ideas. I remember this time as a long “slog through the swamp.”

另一个高点

Another High

第二天下午,北达科他州卡灵顿。天气真的很热,我也累坏了。我正在执行“莱斯利值班”,穿过城镇后,我们在当地随处可见(并装有空调)的冰品皇后扎营。我看着莱斯利狼吞虎咽地吃下一份软发球,感觉“上帝站在我们一边”,因为今天的偶然性已经在很大程度上介入了。不,风仍在从东向西吹,风力强劲。机缘巧合以我哥哥的形式出现。他从缅因州来和我们一起骑行一周。玛丽·安妮昨天下午在迈诺特机场接了他和他的自行车。他精力充沛,是一位非常非常强壮的骑手。他主动提出为我们破风,就像你在自行车比赛中看到的那样。经过一些在职培训(当我们撞到他的后轮时,贝丝和我到麦田里进行了几次短途旅行),贝丝和我弄清楚了如何在他的后轮后面六英寸处骑行。我们试图通过更快-更慢-更快的对话保持同步,今天我们骑行了 79 英里。现在很明显,我们已经“渡过难关”,如有必要,我们将离开北达科他州,比我兄弟的方向盘落后几英寸。

Carrington, ND, the next afternoon. It is really hot, and I am dead tired. I am on “Leslie duty,” and after walking though town, we are encamped in the ubiquitous (and air-conditioned) local Dairy Queen. I am watching Leslie slurp down a soft serve, feeling like “God is on our side,” as serendipity has intervened in a big way today. No, the wind is still blowing at gale force from east to west. Serendipity came in the form of my brother. He has come from Maine to ride with us for a week. Mary Anne picked him and his bicycle up at the Minot airport yesterday afternoon. He is fresh and a very, very strong rider. He offers to break the wind for us, like you see in bicycle races. With some on-the-job training (and a couple of excursions into the wheat fields when we hit his rear wheel), Beth and I figure out how to ride six inches behind his rear wheel. With us trying to stay synchronized with a faster-slower-faster dialog, we rode 79 miles today. It is now clear we are “over the hump” and will get out of North Dakota, a few inches behind my brother’s wheel, if necessary.

明尼苏达州巴特莱克,1988 年 7 月 4 日,第 30 天。我们今天休息并参加这个小镇一年一度的 7 月 4 日游行。这是一次相当不错的体验——当地乐队、小丑分发糖果,莱斯利很高兴地接受了,还有圣地兄弟会穿着他们的衣服。小汽车。这是我永远不会忘记的美国风情。美国乡村非常照顾我们,无论是让我们的自行车在路过时保持距离,愿意兑现我们的旅行支票,还是提醒我们注意道路危险和绕行。

Battle Lake, MN, July 4, 1988, Day 30. We are resting today and attending the annual 4th of July parade in this small town. It is quite an experience—the local band, clowns giving out candy, which Leslie happily takes, and Shriners in their little cars. It is a slice of Americana I will never forget. Rural America has taken very good care of us, whether by giving our bike a wide berth when passing, willingly cashing our travelers checks, or alerting us to road hazards and detours.

加利福尼亚州伯克利,1992 年。根据我的经验,在 DBMS 领域真正发挥作用的唯一方法是将您的想法带入商业市场。理论上,人们可以联系 DBMS 公司并尝试说服他们采用新的东西。事实上,有一个明显的“友好”公司——安格尔公司——尽管它当时有自己的优先事项。

Berkeley, CA, 1992. In my experience, the only way to really make a difference in the DBMS arena is to get your ideas into the commercial marketplace. In theory, one could approach the DBMS companies and try to convince them to adopt something new. In fact, there was an obvious “friendly” one—Ingres Corporation—although it had its own priorities at the time.

我很少看到技术转让以这种方式发生。哈佛商学院教授克莱顿·克里斯蒂安森(Clayton Christiansen)写了一本精彩的书,名为《创新者困境》。他的论点是,技术颠覆对于现有企业来说非常具有挑战性。具体来说,使用旧技术的老牌供应商很难在不失去客户群的情况下转向新方法。因此,颠覆性的想法通常不会在老牌供应商中找到接受者,而创办一家初创公司来证明自己的想法是首选。

I have rarely seen technology transfer happen in this fashion. There is a wonderful book by Harvard Business School professor Clayton Christiansen called The Innovators Dilemma. His thesis is technology disruptions are very challenging for the incumbents. Specifically, it is very difficult for established vendors with old technology to morph to a new approach without losing their customer base. Hence, disruptive ideas do not usually find a receptive audience among the established vendors, and launching a startup to prove one’s ideas is the preferred option.

到 1992 年中期,我结束了与 Ingres 的合作关系,并且已经过去了足够长的时间,我不再受与该公司签订的竞业禁止协议的约束。我准备创办一家商业 Postgres 公司,并联系了我的朋友“高鲨鱼”。他欣然同意参与其中。接下来是与“陆地鲨鱼头”进行了一场有点痛苦的条款谈判,我接受了融资合同条款和条件的在职培训。最后,我明白了我被要求签署的内容。那是一段艰难的时期,我不止一次改变了主意。最终,我们达成了协议,Postgres 获得了 100 万美元的风险投资来启动。

By mid-1992 I had ended my association with Ingres and a sufficient amount of time had passed that I was free of my non-compete agreement with the company. I was ready to start a commercial Postgres company and contacted my friend the “Tall Shark.” He readily agreed to be involved. What followed was a somewhat torturous negotiation of terms with the “Head Land Shark,” with me getting on-the-job training in the terms and conditions of a financing contract. Finally, I understood what I was being asked to sign. It was a difficult time, and I changed my mind more than once. In the end, we had a deal, and Postgres had $1 million in venture capital to get going.

安格尔学术团队的两位明星——“Quiet”和“EMP1”——立即过来提供帮助。此后不久,“Triple Rock”也加入了他们,我们有了一个核心实施团队。我还联系了“妈妈”和她的丈夫“矮个子”,他们也加入了进来,我们开始行动,高鲨鱼担任临时首席执行官。我们最初的工作是将研究代码行打造成商业形式,将查询语言从 QUEL 转换为 SQL,编写文档,修复错误,并清理整个系统的“残骸”。

Right away two stars from the academic Ingres team—“Quiet” and “EMP1”—moved over to help. They were joined shortly thereafter by “Triple Rock,” and we had a core implementation team. I also reached out to “Mom” and her husband, the “Short One,” who also jumped on board, and we were off and running, with the Tall Shark acting as interim CEO. Our initial jobs were to whip the research code line into commercial shape, convert the query language from QUEL to SQL, write documentation, fix bugs, and clean up the “cruft” all over the system.

加利福尼亚州埃默里维尔,1993 年。经过几次命名失误后,我们选择了 Illustra,我们的目标是找到愿意使用(并希望付费)初创公司系统的客户。我们必须找到一个引人注目的垂直市场,而我们选择关注的市场是地理数据。Triple Rock 编写了一系列具有适当函数(例如从点到线的距离)的点、线和多边形的抽象数据类型。

Emeryville, CA, 1993. After a couple of naming gaffes, we chose Illustra, and our goal was to find customers willing to use (and hopefully pay for) a system from a startup. We had to find a compelling vertical market, and the one we chose to focus on was geographic data. Triple Rock wrote a collection of abstract data types for points, lines, and polygons with the appropriate functions (such as distance from a point to a line).

在包括“企业家变身鲨鱼”在内的新投资者注入资金后,我们再次耗尽了资金,导致了前面提到的肯尼巴戈打来的电话。此后不久,我们很幸运地能够聘请“经验之声”作为真正的首席执行官,他聘请“Smooth”担任销售副总裁,以补充之前受聘负责营销的“Uptone”。我们拥有一家真正的公司,拥有运作良好的工程团队和世界一流的管理人员。未来正在向好的方向发展。

After an infusion of capital from new investors, including the “Entrepreneur-Turned-Shark,” we again ran out of money, prompting the phone call from Kennebago noted earlier. Soon thereafter, we were fortunate to be able to hire the “Voice-of-Experience” as the real CEO, and he recruited “Smooth” to be VP of sales, complementing “Uptone,” who was previously hired to run marketing. We had a real company with a well-functioning engineering team and world-class executives. The future was looking up.

密歇根州卢丁顿,第 38 天。我们从密歇根湖渡轮出发步行前往波士顿,然后开始向东南骑行。一望无际的上中西部已在我们身后;现在距离波士顿不到 1,000 英里!不知何故,我们不再需要渡过水,这让我们感到安心。我们感觉很好。看起来我们可能会成功。

Luddington, MI, Day 38. We walk Boston Bound off the Lake Michigan ferry and start riding southeast. The endless Upper Midwest is behind us; it is now less than 1,000 miles to Boston! Somehow it is reassuring that we have no more more water to cross. We are feeling good. It is beginning to look like we might make it.

高潮不会持续

The High Does Not Last

纽约州埃利科特维尔,第 49 天。今天是非常糟糕的一天。我们的第一个问题发生在我穿着自行车夹板走下宾夕法尼亚州科里酒店的楼梯时。我在大理石地板上滑倒并扭伤了膝盖。今天,我们只有三条好腿推动波士顿方向。然而,更大的问题是我们到达了阿勒格尼山脉。威斯康星州、密歇根州和俄亥俄州持平。轻松的骑行结束了,我们的自行车地图让我们一遍又一遍地上下相同的 500 英尺。此外,这里的道路规划者似乎也不相信之字形道路。为了爬上其中一些山坡,我们将档位调到 21 个档位中的最低档位,这是一项令人筋疲力尽的工作。正如你想象的那样,我们的心情并不好。当贝丝让莱斯利上床睡觉时,我问了埃利科特维尔的旅馆老板一个简单的问题:“我们如何在不爬所有这些山丘的情况下到达纽约州奥尔巴尼?”

Ellicottville, NY, Day 49. Today was a very bad day. Our first problem occurred while I was walking down the stairs of the hotel in Corry, PA, in my bicycle cleats. I slipped on the marble floor and wrenched my knee. Today, we had only three good legs pushing Boston Bound along. However, the bigger problem is we hit the Alleghany Mountains. Wisconsin, Michigan, and Ohio are flat. That easy riding is over, and our bicycle maps are sending us up and then down the same 500 feet over and over again. Also, road planners around here do not seem to believe in switchbacks; we shift into the lowest of our 21 gears to get up some of these hills, and it is exhausting work. We are not, as you can imagine, in a good mood. While Beth is putting Leslie to bed, I ask the innkeeper in Ellicottville a simple question, “How do we get to Albany, NY, without climbing all these hills?”

加利福尼亚州埃默里维尔,1993 年。我们的第一个营销挑战突然出现了。很明显,我们的“最佳点”是任何可以通过 ADT 加速的应用程序。只要这是真的,我们就会比任何其他 DBMS 拥有不公平的优势。然而,我们面临着“第二十二条军规”的情况。在一些“灯塔”客户之后,更为谨慎的客户明确表示他们希望从主要 GIS 供应商(例如 ArcInfo 和 MapInfo)获得 GIS 功能。我们需要招募特定垂直市场的应用公司,并说服他们将其软件的内核重组为ADT——这不是一项简单的任务。应用程序供应商自然会说:“帮助我理解为什么我们应该与您合作参与这个联合项目。” 更直白地说,“你们有多少客户?我可以从这个额外的产品分销渠道中赚多少钱?” 也就是说,我们将这种重新架构视为任何合理的应用程序供应商都应该接受的改变游戏规则的技术转变。然而,应用程序供应商将其视为只是一个新的分销渠道。这就提出了第 22 条军规:没有 ADT,我们就无法获得客户;没有客户,我们就无法获得 ADT。我们正在思考这种令人沮丧的情况,试图找出下一次危机发生时该怎么办。

Emeryville, CA, 1993. Out of nowhere comes our first marketing challenge. It was clear our “sweet spot” was any application that could be accelerated through ADTs. We would have an unfair advantage over any other DBMS whenever this was true. However, we faced a Catch-22 situation. After a few “lighthouse” customers, the more cautious ones clearly said they wanted GIS functionality from the major GIS vendors (such as ArcInfo and MapInfo). We needed to recruit application companies in specific vertical markets and convince them to restructure the inner core of their software into ADTs—not a trivial task. The application vendors naturally said, “Help me understand why we should engage with you in this joint project.” Put more bluntly, “How many customers do you have and how much money can I expect to make from this additional distribution channel for my product?” That is, we viewed this rearchitecting as a game-changing technology shift any reasonable application vendor should embrace. However, application vendors viewed it as merely a new distribution channel. This brought up the Catch-22: Without ADTs we could not get customers, and without customers we could not get ADTs. We were pondering this depressing situation, trying to figure out what to do, when the next crisis occurred.

加利福尼亚州奥克兰,1994 年。我们再次没钱了,陆地鲨鱼宣布我们在实现公司目标方面没有取得良好进展。更直率地说,他们将提供额外的资金,但价格要低于上一轮融资。我们面临着可怕的“下一轮”。在最初(通常是痛苦的)谈判之后,当所有权是公司团队和土地鲨鱼之间的零和游戏时,投资者和团队通常站在桌子的​​同一边。我们的目标是建立一家成功的公司,在必要时通过提高股价筹集资金。唯一的分歧在于“支出”。投资者自然希望你花更多的钱来取得更快的进展,因为这将确保他们对公司的所有权比例不断增加。相比之下,该团队希望“亲吻每一分钱”,以最大限度地减少筹集的资金量并最大化他们的所有权。解决这些差异通常非常简单。当需要新一轮资金时,通常会引入新投资者来确定该轮融资的价格。使这个值尽可能高符合团队的利益。目前的投资者将被要求以任何商定的价格按比例增加其份额,以支持这一轮融资。

Oakland, CA, 1994. We were again out of money, and the Land Sharks announced we were not making good progress toward our company goals. Put more starkly, they would put up additional capital, but only at a price lower than the previous financing round. We were facing the dreaded “down round.” After the initial (often painful) negotiation, when ownership is a zero-sum game between the company team and the Land Sharks, the investors and the team are usually on the same side of the table. The goal is to build a successful company, raising money when necessary at increasing stock prices. The only disagreement concerns the “spend.” The investors naturally want you to spend more to make faster progress, since that would ensure them an increasing percentage ownership of the company. In contrast, the team wants to “kiss every nickel” to minimize the amount of capital raised and maximize their ownership. Resolving these differences is usually pretty straightforward. When a new round of capital is needed, a new investor is typically brought in to set the price of the round. It is in the team’s interest to make this as high as possible. The current investors will be asked to support the round, by adding their pro-rata share at whatever price is agreed on.

然而,如果当前投资者拒绝以更高的价格支持新一轮融资,会发生什么情况呢?自然地,新的投资者会追随现有投资者的脚步,并制定一个新的较低价格。此时,大多数融资协议中都有一项条款,即公司必须事后将上一轮(或多轮)融资重新定价至新价格。正如你可以想象的那样,下一轮融资对团队来说会带来难以置信的财务稀释,他们自然会说:“如果你想让我们继续下去,你需要增加我们的选择权。” 这样一来,讨论就变成了现有投资者、新投资者和团队之间的三向谈判。这又是一场痛苦的零和游戏。

However, what happens if the current investors refuse to support a new round at a higher price? Naturally, a new investor will follow the lead of the current ones, and a new lower price is established. At this point, there is a clause in most financing agreements that the company must ex post facto reprice the previous financing round (or rounds) down to the new price. As you can imagine, a down round is incredibly dilutive financially to the team, who would naturally say, “If you want us to continue, you need to top up our options.” As such, the discussion becomes a three-way negotiation among the existing investors, the new investors, and the team. It is another painful zero-sum game.

当尘埃落定后,Illustra 的员工基本上通过新的选择得到了整合,而 Land Sharks 的持股比例也只发生了轻微的变化,整个过程留下了苦涩的味道。此外,管理层已经分心了几个月。陆地鲨鱼似乎在互相玩某种奇怪的权力游戏,我不明白。无论如何,Illustra 都会继续战斗。

When the dust settled, the Illustra employees were largely made whole through new options, the percentage ownership among the Land Sharks had changed only slightly, and the whole process left a bitter taste. Moreover, management had been distracted for a couple of months. The Land Sharks seemed to be playing some sort of weird power game with each other I did not understand. Regardless, Illustra will live to fight another day.

未来再次抬头

The Future Looks Up (Again)

纽约州特洛伊,第 56 天。埃利科特维尔的旅店老板告诉我们,对于 19 世纪在东海岸和中部地区之间运输货物的人来说,这是显而易见的事情。他说:“向北行驶到伊利运河,然后右转。” 经过愉快(平坦)的莫霍克山谷骑行后,我们到达特洛伊,看到了前往波士顿的第一个路标,距离现在仅 186 英里。最后还有三天假期!我想起了加利福尼亚州奥林达野猫峡谷路底部的一个彩绘标志,位于从东湾返回伯克利的山的起点。它简单地写着“最后的山”。我们现在已经到了最后一座山了。我们只需攀登伯克希尔山脉即可抵达马萨诸塞州皮茨菲尔德。然后可以轻松骑行前往波士顿。

Troy, NY, Day 56. The innkeeper in Ellicottville tells us what was obvious to anybody in the 19th century moving goods between the eastern seaboard and the middle of the country. He said, “Ride north to the Erie Canal and hang a right.” After a pleasant (and flat) ride down the Mohawk Valley, we arrive at Troy and see our first road sign for Boston, now just 186 miles away. The end is three days off! I am reminded of a painted sign at the bottom of Wildcat Canyon Road in Orinda, CA, at the start of the hill that leads back to Berkeley from the East Bay. It says simply “The Last Hill.” We are now at our last hill. We need only climb the Berkshires to Pittsfield, MA. It is then easy riding to Boston.

加利福尼亚州奥克兰,1995 年。在我们的下一轮融资和 ADT 的第 22 条军规之后不久,意外再次出现。互联网正在腾飞,大多数企业都在试图弄清楚如何利用它。Uptone 对 Illustra 进行了出色的重新定位。我们成为“网络空间数据库”,能够存储文本和图像等互联网数据。此外,他还自愿将 Illustra 作为“网络空间 24 小时”的数据库,获得了令人难以置信的播出时间,这是摄影记者在全球范围内每小时创建一个网页的努力,获得了大量积极的宣传。突然间,Illustra 成为“新事物”,我们沐浴在反射的荣耀之中。销量回升,前景一片光明。经验之声加大力度,我们聘请了新员工。也许这就是广受羡慕的“曲棍球棒式增长”的开始。” 我们被要求为一家非常大的网络供应商做一个试点应用程序,这是一项潜在的公司交易。然而,我们也正在与传统的 RDBMS 进行比较。

Oakland, CA, 1995. Shortly after our down round and the Catch-22 on ADTs, serendipity occurred once more. The Internet was taking off, and most enterprises were trying to figure out what to do with it. Uptone executes a brilliant repositioning of Illustra. We became the “database for cyberspace,” capable of storing Internet data like text and images. He additionally received unbelievable airtime by volunteering Illustra to be the database for “24 Hours in Cyberspace,” a worldwide effort by photojournalists to create one Web page per hour, garnering a lot of positive publicity. Suddenly, Illustra was “the new thing,” and we were basking in reflected glory. Sales picked up and the future looked bright. The Voice-of-Experience stepped on the gas and we hired new people. Maybe this was the beginning of the widely envied “hockey stick of growth.” We were asked to do a pilot application for a very large Web vendor, a potentially company-making transaction. However, we were also in a bake-off with the traditional RDBMSs.

好时光不会持续太久

The Good Times Do Not Last Long

加利福尼亚州奥克兰,1995 年。现实很快就会露出丑陋的一面。Web 供应商没有对我们擅长的任务(例如地理搜索或将文本与结构化数据和图像集成)进行基准测试,而是决定将我们与传统的基本事务处理用例进行比较,其中目标是在标准银行应用程序上每秒执行尽可能多的交易。它证明了自己的选择是合理的,“在每个互联网应用程序中,都有一个满足多媒体要求的业务数据处理子部分,因此我们将首先对其进行测试。”

Oakland, CA, 1995. Reality soon rears its ugly head. Instead of doing a benchmark on a task we were good at (such as geographic search or integrating text with structured data and images), the Web vendor decided to compare us on a traditional bread-and-butter transaction-processing use case, in which the goal is to perform as many transactions per second as you can on a standard banking application. It justified its choice by saying, “Within every Internet application, there is a business data-processing sub-piece that accompanies the multimedia requirements, so we are going to test that first.”

我立刻感到一阵心痛,因为 Postgres 从来没有被设计为擅长在线事务处理 (OLTP)。我们专注于 ADT、规则和时间旅行,而不是试图在当前的 RDBMS 已优化的领域与它们竞争。尽管我们很乐意进行交易,但这远远超出了我们的掌控范围。我们的表演将是比我们竞争的传统供应商提供的产品要差一个数量级。问题是我在近十年前做出的一系列架构决策,这些决策不容易撤销;例如,Illustra 作为每个用户的操作系统进程运行。人们普遍认为这种架构易于实现,但在许多用户执行简单操作的高度并发工作负载上表现不佳。此外,我们没有积极地编译查询计划,因此做简单事情的开销很高。当我们的 ADT 具有优势的复杂查询或用例时,这些缺点就不是问题了。但当进行简单的业务数据处理时,我们就会输,而且输得很惨。

There was immediately a pit in my stomach because Postgres was never engineered to excel at online transaction processing (OLTP). We were focused on ADTs, rules, and time travel, not on trying to compete with current RDBMSs on the turf for which they had been optimized. Although we were happy to do transactions, it was far outside our wheelhouse. Our performance was going to be an order-of-magnitude worse than what was offered by the traditional vendors we were competing against. The problem is a collection of architectural decisions I made nearly a decade earlier that are not easy to undo; for example, Illustra ran as an operating system process for each user. This architecture was well understood to be simple to implement but suffers badly on a highly concurrent workload with many users doing simple things. Moreover, we did not compile query plans aggressively, so our overhead to do simple things was high. When presented with complex queries or use cases where our ADTs were advantageous, these shortcomings are not an issue. But when running simple business data processing, we were going to lose, and lose badly.

我们陷入了严峻的现实,即我们必须大幅提高事务处理性能,而这既不简单也不快速。我花了几个小时研究 Short One,试图找到一种方法来实现这一目标,而无需大量的重新编码、能源、成本和延迟。我们画了一个空白。Illustra 必须进行成本高昂的重新架构。

We were stuck with the stark reality that we must dramatically improve transaction-processing performance, which will be neither simple nor quick. I spent hours with the Short One trying to find a way to make it happen without a huge amount of recoding, energy, cost, and delay. We drew a blank. Illustra would have to undergo a costly rearchitecting.

故事结束

The Stories End

马萨诸塞州萨顿,第 59 天。马萨诸塞州的道路标记很差,我们从未见过如此无礼的司机。在这里骑行并不愉快,我们无法想象尝试导航波士顿方向进入波士顿市中心,更不用说找到可以进入大海的地方了。我们决定在马萨诸塞州昆西的沃拉斯顿海滩结束比赛,该海滩位于波士顿以南约 10 英里处。在敷衍地拖着我们的自行车穿过海滩并将前轮浸入海浪中之后,我们就完成了。我们在海滨咖啡馆喝了一杯香槟,思考接下来会发生什么。

Sutton, MA, Day 59. Massachusetts roads are poorly marked, and we have never seen more discourteous drivers. Riding here is not pleasant, and we cannot imagine trying to navigate Boston Bound into downtown Boston, let alone find someplace where we can access the ocean. We settle instead for finishing at Wollaston Beach in Quincy, MA, approximately 10 miles south of Boston. After the perfunctory dragging of our bike across the beach and dipping the front wheel in the surf, we are done. We drink a glass of champagne at a beachside café and ponder what happens next.

图像

马萨诸塞州沃拉斯顿海滩:第 59 天

Wollaston Beach, MA: Day 59

加利福尼亚州奥克兰,1996 年 2 月。机缘巧合再次出现。我们在 Web 供应商基准测试中与之竞争的供应商之一已受到该基准测试的严重威胁。它看到 Illustra 将轻松赢得各种 Internet 风格的基准测试,并且 Web 供应商在该领域将有大量要求。结果,它选择收购 Illustra。从很多方面来说,这就是我们所有问题的答案。该公司拥有一个高性能 OLTP 平台,我们可以在其中插入 Illustra 功能。它也是一家大公司,拥有足够的“影响力”来让应用程序供应商将 ADT 添加到其系统中。我们完成了我们认为是互惠互利的交易,并开始将 Illustra 功能放入其引擎中。

Oakland, CA, February 1996. Serendipity occurs yet again. One of the vendors we competed against on the Web vendor’s benchmark has been seriously threatened by the benchmark. It saw Illustra would win a variety of Internet-style benchmarks hands-down, and Web vendors would have substantial requirements in this area. As a result, it elected to buy Illustra. In many ways, this was the answer to all our issues. The company had a high-performance OLTP platform into which we could insert the Illustra features. It was also a big company with sufficient “throw-weight” to get application vendors to add ADTs to its system. We consummated what we thought was a mutually beneficial transaction and set to work putting Illustra features into its engine.

我将在这里结束 Illustra 的故事,尽管还有很多东西要讲,其中大部分都相当黑暗——股东诉讼、多名新任首席执行官,以及最终公司的出售。显而易见的要点是,在选择您同意联姻的公司时要非常小心。

I will end the Illustra story here, even though there is much more to tell, most of it fairly dark—a shareholder lawsuit, multiple new CEOs, and ultimately a sale of the company. The obvious takeaway is to be very careful about the choice of company you agree to marry.

为什么要讲自行车故事?

Why a Bicycle Story?

你可能想知道为什么我会讲这个骑自行车的故事。有以下三个原因。首先,我想为您提供一个成功穿越美国的算法。

You might wonder why I would tell this bicycling story. There are three reasons. First, I want to give you an algorithm for successfully riding across America.

图像

显然,遵循这个算法将会成功。如果发生的话,撒上一些意外的惊喜。现在稍微抽象一下,用“目标”代替“海洋”,用“适当行动”代替“向东行驶”

It is clear that following this algorithm will succeed. Sprinkle in some serendipity if it occurs. Now abstract it a bit by substituting goal for Ocean and Appropriate Action for Ride east

图像

由于我将再次使用该算法,因此我将其设为宏

Since I will be using this algorithm again, I will make it a macro

图像

通过这个序言,我可以给出我 1988 年左右的简历的缩略图。

With this preamble, I can give a thumbnail sketch of my résumé, circa 1988.

图像

根据我的经验,获得博士学位。(大约五年)是该算法发挥作用的一个例子。有起起落落(通过预考),有起落(第一次不及格),还有很多艰难的经历(写一篇我的委员会可以接受的论文)。获得终身教职(又五年)是这种算法发挥作用的一个更不令人愉快的例子。

In my experience, getting a Ph.D. (approximately five years) is an example of this algorithm at work. There are ups (passing prelims), downs (failing quals the first time), and a lot of slog through the swamp (writing a thesis acceptable to my committee). Getting tenure (another five years) is an even less pleasant example of this algorithm at work.

这就引入了提出该算法的第二个原因。显而易见的问题是,“为什么有人想要进行这次自行车旅行?” 这是漫长而艰难的时期,有沮丧、兴高采烈、无聊的时期,还有随处可见的劣质食物。我只能说,“这听起来是个好主意,我会毫不犹豫地再去一次。” 就像一个博士。make-it-happen和终身教职,这是一个实际行动的例子。显而易见的结论是,我天生就会寻找make-it-happen机会,并从中获得巨大的满足感。

This introduces the second reason for presenting the algorithm. The obvious question is, “Why would anybody want to do this bicycle trip?” It is long and very difficult, with periods of depression, elation, and boredom, along with the omnipresence of poor food. All I can say is, “It sounded like a good idea, and I would go again in a heartbeat.” Like a Ph.D. and tenure, it is an example of make-it-happen in action. The obvious conclusion to draw is I am programmed to search out make-it-happen opportunities and get great satisfaction from doing so.

我想在这里过渡到讲述自行车故事的第三个原因。骑行穿越美国是构建系统软件的一个方便的比喻。让我首先写下构建新 DBMS 的算法(参见代码第 2 部分)。

I want to transition here to the third reason for telling the bicycle story. Riding across America is a handy metaphor for building system software. Let me start by writing down the algorithm for building a new DBMS (see code section 2).

下一个问题是:“我如何想出一个新想法?” 答案是:“我不知道。” 然而,这并不能阻止我发表一些评论。从我个人的经验来看,我从来没有通过去山顶思考来想出任何东西。相反,我的想法来自两个来源:与有实际问题的真实用户交谈,然后尝试解决它们。这可以确保我提出有人关心的想法,并且橡胶满足道路而不是天空。第二个来源是向那些会挑战他们的同事反馈可能好的(或坏的)想法。总之,产生好想法的最佳机会是花时间在现实世界中,找到一个可以挑战你智力的环境(如 MIT/CSAIL 和 Berkeley/EECS)。

The next question is, “How do I come up with a new idea?” The answer is, “I don’t know.” However, that will not stop me from making a few comments. From personal experience, I never come up with anything by going off to a mountaintop to think. Instead, my ideas come from two sources: talking to real users with real problems and then trying to solve them. This ensures I come up with ideas that somebody cares about and the rubber meets the road and not the sky. The second source is to bounce possibly good (or bad) ideas off colleagues that will challenge them. In summary, the best chance for generating a good idea is to spend time in the real world and find an environment (like MIT/CSAIL and Berkeley/EECS) where you will be intellectually challenged.

代码第 2 节。

Code Section 2.

图像

代码第 3 节。

Code Section 3.

图像

如果您的想法成立并且您有一个工作原型,那么您可以继续进入第二阶段,该阶段具有现在熟悉的外观(请参阅代码第 3 节)。

If your ideas hold water and you have a working prototype, then you can proceed to phase two, which has a by-now-familiar look (see code section 3).

与其他系统软件一样,构建一个新的 DBMS 很困难,需要十年左右的时间,并且会经历兴高采烈和沮丧的时期。与骑自行车穿越美国只需要体力和毅力不同,构建新的 DBMS 还涉及其他挑战。在原型阶段,必须找出新的接口,包括内部接口和应用程序接口,以及操作系统、网络和持久存储的接口。根据我的经验,第一次就把它们做好是不寻常的。不幸的是,人们通常必须首先构建它才能了解应该如何构建它。您将不得不扔掉代码并重新开始,也许要多次。此外,一切都会影响其他一切。在巨大的设计空间中无情地避免复杂性是一项艰巨的工程挑战。让软件变得快速且可扩展只会让事情变得更加困难。这很像骑马穿越美国。

As with other system software, building a new DBMS is difficult, takes a decade or so, and involves periods of elation and depression. Unlike bicycling across America, which takes just muscles and perseverance, building a new DBMS involves other challenges. In the prototype phase, one must figure out new interfaces, both internal and to applications, as well as to the operating system, networking, and persistent storage. In my experience, getting them right the first time is unusual. Unfortunately, one must often build it first to see how one should have built it. You will have to throw code away and start again, perhaps multiple times. Furthermore, everything influences everything else. Ruthlessly avoiding complexity while navigating a huge design space is a supreme engineering challenge. Making the software fast and scalable just makes things more difficult. It is a lot like riding across America.

商业化也带来了一系列挑战。该软件必须真正工作,生成正确的答案,永不崩溃,并成功处理所有极端情况,包括耗尽任何计算机资源(例如主内存和磁盘)。此外,客户依靠 DBMS 来保证数据不会丢失,因此事务管理必须是万无一失的。这比看起来更困难,因为 DBMS 是多用户软件。可重复的错误,或“Bohrbugs”,很容易从系统中剔除,留下杀手,不可重复的错误,或“Heisenbugs”。试图找到不可重复的错误是一种令人沮丧的练习。更糟糕的是,Heisenbug 通常存在于交易系统中,导致客户丢失数据。这个现实多次让我感到心痛不已。生产(和测试)系统软件需要很长时间并且花费大量资金。能够做到这一点的系统程序员有我的钦佩。总之,构建和商业化新的 DBMS 的特点是

Commercialization adds its own set of challenges. The software must really work, generating the right answer, never crashing, and dealing successfully with all the corner cases, including running out of any computer resource (such as main memory and disk). Moreover, customers depend on a DBMS to never lose their data, so transaction management must be bulletproof. This is more difficult than it looks, since DBMSs are multi-user software. Repeatable bugs, or “Bohrbugs,” are easy to knock out of a system, leaving the killers, nonrepeatable errors, or “Heisenbugs.” Trying to find nonrepeatable bugs is an exercise in frustration. To make matters worse, Heisenbugs are usually in the transaction system, causing customers to lose data. This reality has generated a severe pit in my stomach on several occasions. Producing (and testing) system software takes a long time and costs a lot of money. The system programmers who are able to do this have my admiration. In summary, building and commercializing a new DBMS can be characterized by

Have a good idea (or two or three);

Make-it-happen--for a decade or so;

Have a good idea (or two or three);

Make-it-happen--for a decade or so;

这就提出了一个明显的问题:“为什么有人想做这么困难的事情?” 答案与获得博士学位、获得终身教职或骑车穿越美国是一样的。我倾向于接受这样的挑战。我花了十年的时间努力让 Postgres 成为现实,并且会毫不犹豫地再次这样做。事实上,自从 Postgres 以来我已经做过很多次了。

This brings up the obvious question: “Why would anybody want to do something this difficult?” The answer is the same as with a Ph.D., getting tenure, or riding across America. I am inclined to accept such challenges. I spent a decade struggling to make Postgres real and would do it again in a heartbeat. In fact, I have done it multiple times since Postgres.

当今时代

The Present Day

我将通过跳到 2016 年来谈谈事情的最终结果来结束这个叙述。对于那些希望本文成为对当前好的(和不太好的)想法的评论的人,您可以观看我在 2015 年 IEEE 国际数据工程会议上关于此主题的演讲,网址为:http://kdb.snu 。 ac.kr/data/stonebraker_talk.mp4或 ACM 数字图书馆中本文随附的视频。

I will finish this narrative by skipping to 2016 to talk about how things ultimately turned out. For those of you who were expecting this article to be a commentary on current good (and not-so-good) ideas, you can watch my IEEE International Conference on Data Engineering 2015 talk on this topic at http://kdb.snu.ac.kr/data/stonebraker_talk.mp4 or the video that accompanies this article in the ACM Digital Library.

现在的新罕布什尔州莫尔顿伯勒。前往波士顿的航班以离开时的方式抵达加利福尼亚州,即在我们的车顶上。它现在坐落在我们位于新罕布什尔州的地下室里,积满了灰尘。从那天起,沃拉斯顿海滩就没有再有人骑过它。

Moultonborough, NH, present day. Boston Bound arrived in California the same way it left, on the roof of our car. It now sits in our basement in New Hampshire gathering dust. It has not been ridden since that day at Wollaston Beach.

我仍然倾向于接受身体上的挑战。最近,我决定攀登新罕布什尔州所有 48 座海拔超过 4,000 英尺的山脉。在较柔和的维度中,我正在努力掌握五弦班卓琴。

I am still inclined to accept physical challenges. More recently, I decided to climb all 48 mountains in New Hampshire that are over 4,000 feet. In a softer dimension, I am struggling to master the five-string banjo.

Leslie 现在是纽约市一家天使投资人支持的初创公司的营销总监,该公司的软件碰巧在 Postgres 上运行。她拒绝主修计算机科学。

Leslie is now Director of Marketing for an angel-investor-backed startup in New York City, whose software incidentally runs on Postgres. She refused to major in computer science.

Illustra 已成功集成到 Informix 代码库中。IBM 于 2001 年收购了 Informix,该系统仍然可用。原始的 Illustra 代码行仍然存在于 IBM 档案中的某个地方。1995 年,当“Happy”和“Serious”用 SQL 接口取代 QUEL 查询语言时,学术 Postgres 代码行得到了巨大的提升。随后,它被一个专门的皮卡团队采用,并指导其发展至今。这是开源开发运作的光辉典范。有关此演变的简短历史,请参阅 Momjian [ 4 ]。该开源代码系列也已集成到当前的多个 DBMS 中,包括 Greenplum 和 Netezza。大多数商业 DBMS 都使用 Postgres 风格的 ADT 来扩展其引擎。

Illustra was successfully integrated into the Informix code base. This system is still available from IBM, which acquired Informix in 2001. The original Illustra code line still exists somewhere in the IBM archives. The academic Postgres code line got a huge boost in 1995 when “Happy” and “Serious” replaced the QUEL query language with a SQL interface. It was subsequently adopted by a dedicated pickup team that shepherd its development to this day. This is a shining example of open source development in operation. For a short history of this evolution, see Momjian [4]. This open source code line has also been integrated into several current DBMSs, including Greenplum and Netezza. Most commercial DBMSs have extended their engines with Postgres-style ADTs.

我现在想以三个最后的想法作为结束语。首先,我想提一下我构建的其他 DBMS——Ingres、C-Store/Vertica、H-Store/VoltDB 和 SciDB——都有与 Postgres 类似的开发故事。我可以选择其中任何一个在本文中进行讨论。他们都有一群超级明星研究程序员,我就骑在他们的肩膀上。多年来,他们将我的想法变成了工作原型。其他编程巨星已将原型转换为用于生产部署的防弹工作代码。经验丰富的初创企业高管小心翼翼地指导着脆弱的小公司。我特别感谢我现在的商业伙伴“Cueball”,感谢他们在波涛汹涌的水域中的精心管理。此外,我要感谢陆地鲨鱼,没有他们的资本,这一切都不可能实现,

I now want to conclude with three final thoughts. First, I want to mention the other DBMSs I have built—Ingres, C-Store/Vertica, H-Store/VoltDB, and SciDB—all have development stories similar to that of Postgres. I could have picked any one of them to discuss in this article. All had a collection of superstar research programmers, on whose shoulders I have ridden. Over the years, they have turned my ideas into working prototypes. Other programming superstars have converted the prototypes into bulletproof working code for production deployment. Skilled startup executives have guided the small fragile companies with a careful hand. I am especially indebted to my current business partner, “Cueball,” for careful stewardship in choppy waters. Moreover, I want to acknowledge the Land Sharks, without whose capital none of this would be possible, especially the “Believer,” who has backed multiple of my East Coast companies.

我特别感谢我的合作伙伴 Larry Rowe,以及以下 39 位编写 Postgres 的伯克利学生和工作人员:Jeff Anton、Paul Aoki、James Bell、Jennifer Caetta、Philip Chang、Jolly Chen、Ron Choi、Matt Dillon、Zelaine Fong、亚当·格拉斯、杰弗里·吴、史蒂文·格雷迪、塞尔吉·格拉尼克、马蒂·赫斯特、乔伊·海勒斯坦、迈克尔·广滨、洪金恒、洪伟、阿南特·金格伦、格雷格·凯姆尼茨、马塞尔·科纳克、凯斯·拉森、鲍里斯·利夫什茨、杰夫·梅雷迪思、金格尔·奥格尔、 Mike Olson、Nels Olsen、LayPeng Ong、Carol Paxson、Avi Pfeffer、Spyros Potamianos、Sunita Surawagi、David Muir Sharnoff、Mark Sullivan、Cimarron Taylor、Marc Teitelbaum、Yongdong Wang、Kristen Wright 和 Andrew Yu。

I am especially indebted to my partner, Larry Rowe, and the following 39 Berkeley students and staff who wrote Postgres: Jeff Anton, Paul Aoki, James Bell, Jennifer Caetta, Philip Chang, Jolly Chen, Ron Choi, Matt Dillon, Zelaine Fong, Adam Glass, Jeffrey Goh, Steven Grady, Serge Granik, Marti Hearst, Joey Hellerstein, Michael Hirohama, Chin-heng Hong, Wei Hong, Anant Jhingren, Greg Kemnitz, Marcel Kornacker, Case Larsen, Boris Livshitz, Jeff Meredith, Ginger Ogle, Mike Olson, Nels Olsen, LayPeng Ong, Carol Paxson, Avi Pfeffer, Spyros Potamianos, Sunita Surawagi, David Muir Sharnoff, Mark Sullivan, Cimarron Taylor, Marc Teitelbaum, Yongdong Wang, Kristen Wright, and Andrew Yu.

其次,我要感谢我的妻子贝丝。当我们穿越美国时,她不仅要花两个月的时间看着我的背影,还要处理我的目标导向、创办公司的愿望,以及常常无情地专注于“下一步”。我很难相处,而她却很忍耐。我不确定她是否意识到她对阻止我跌落个人悬崖负有主要责任。

Second, I want to acknowledge my wife, Beth. Not only did she have to spend two months looking at my back as we crossed America, she also gets to deal with my goal orientation, desire to start companies, and, often, ruthless focus on “the next step.” I am difficult to live with, and she is long-suffering. I am not sure she realizes she is largely responsible for keeping me from falling off my own personal cliffs.

第三,我要感谢我的朋友、同事和偶尔的参谋,Jim Gray,1998 年 ACM AM 图灵奖获得者。他于 9 年前的 2007 年 1 月 28 日在海上失踪。我想我代表了整个人。当我对 DBMS 社区说: Jim:我们每天都想念你。

Third, I want to acknowledge my friend, colleague, and occasional sounding board, Jim Gray, recipient of the ACM A.M. Turing Award in 1998. He was lost at sea nine years ago on January 28, 2007. I think I speak for the entire DBMS community when I say: Jim: We miss you every day.

参考

References

[ 1 ] 日期,C. 参照完整性。第七届国际超大型数据库会议论文集(法国戛纳,9 月 9 日至 11 日)。摩根考夫曼出版社,1981 年,2-12。

[1]  Date, C. Referential integrity. In Proceedings of the Seventh International Conference on Very Large Data Bases Conference (Cannes, France, Sept. 9–11). Morgan Kaufmann Publishers, 1981, 2–12.

[ 2 ] Madden, S. Mike Stonebraker 的 70生日活动。麻省理工学院计算机科学与人工智能实验室,马萨诸塞州剑桥,2014 年 4 月 12 日;http://webcast.mit.edu/spr2014/csail/12apr14/

[2]  Madden, S. Mike Stonebraker’s 70th Birthday Event. MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, Apr. 12, 2014; http://webcast.mit.edu/spr2014/csail/12apr14/

[ 3 ] Mohan, C.、Haderle, D.、Lindsay, B.、Pirahesh, H. 和 Schwarz, P. Aries:一种使用预写日志记录支持细粒度锁定和部分回滚的事务恢复方法。ACM 数据库系统汇刊 17、1(1992 年 3 月)、94–162。

[3]  Mohan, C., Haderle, D., Lindsay, B., Pirahesh, H., and Schwarz, P. Aries: A transaction recovery method supporting fine granularity locking and partial rollbacks using write-ahead logging. ACM Transactions on Database Systems 17, 1 (Mar. 1992), 94–162.

[ 4 ] Momjian, B. PostgreSQL 开源发展史; https://momjian.us/main/writings/pgsql/history.pdf

[4]  Momjian, B. The History of PostgreSQL Open Source Development; https://momjian.us/main/writings/pgsql/history.pdf

[ 5 ] Stonebraker, M.Postgres 存储系统的设计。第 13国际超大型数据库会议论文集(英国布莱顿,9 月 1-4 日)。摩根考夫曼出版社,1987 年,289-300。

[5]  Stonebraker, M. The design of the Postgres storage system. In Proceedings of the 13th International Conference on Very Large Data Bases Conference (Brighton, England, Sept. 1–4). Morgan Kaufmann Publishers, 1987, 289–300.

[ 6 ] Stonebraker, M. 和 Rowe, L. Postgres 的设计。1986 年 SIGMOD 会议记录(华盛顿特区,5 月 28-30 日)。ACM 出版社,纽约,1986 年,340–355。

[6]  Stonebraker, M. and Rowe, L. The design of Postgres. In Proceedings of the 1986 SIGMOD Conference (Washington, D.C., May 28–30). ACM Press, New York, 1986, 340–355.

[ 7 ] Stonebraker, M 和 Rowe, L。Postgres 数据模型。第 13 届国际超大型数据库会议论文集(英国布莱顿,9 月 1-4 日)。摩根考夫曼出版社,1987 年,83-96。

[7]  Stonebraker, M and Rowe, L. The Postgres data model. In Proceedings of the 13th International Conference on Very Large Data Bases Conference (Brighton, England, Sept. 1–4). Morgan Kaufmann Publishers, 1987, 83–96.

[ 8 ] Stonebraker, M.、Rubenstein, B. 和 Guttman, A. 抽象数据类型和抽象索引在 CAD 数据库中的应用。ACM-IEEE 工程设计应用研讨会论文集(加利福尼亚州圣何塞,5 月)。ACM 出版社,纽约,1983 年,107–113。

[8]  Stonebraker, M., Rubenstein, B., and Guttman, A. Application of abstract data types and abstract indices to CAD databases. In Proceedings of the ACM-IEEE Workshop on Engineering Design Applications (San Jose, CA, May). ACM Press, New York, 1983, 107–113.

Michael Stonebraker ( stonebraker@csail.mit.edu ) 是马萨诸塞州剑桥市麻省理工学院计算机科学和人工智能实验室的兼职教授。

Michael Stonebraker (stonebraker@csail.mit.edu) is an adjunct professor in the MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA.

© 2016 ACM 0001-0782/16/02 15.00 美元

© 2016 ACM 0001-0782/16/02 $15.00

最初发表于Communications of the ACM , 59(2): 74–83, 2016。原始 DOI: 10.1145/ 2869958

Originally published in Communications of the ACM, 59(2): 74–83, 2016. Original DOI: 10.1145/ 2869958

第二部分

PART II

迈克·斯通布雷克的职业生涯

MIKE STONEBRAKER’S CAREER

1

1

让它发生:迈克尔·斯通布雷克的一生

Make it Happen: The Life of Michael Stonebraker

塞缪尔·马登

Samuel Madden

让它发生

Make it happen.

——迈克尔·斯通布雷克

—Michael Stonebraker

概要

Synopsis

Michael Stonebraker 是美国计算机科学家、教师、发明家、技术企业家,也是近 40 年来数据库领域的知识领袖。他与加州大学伯克利分校的教授 Eugene Wong 一起开发了第一个关系数据库管理系统(RDBMS)原型(Ingres),与 IBM 的 System R 和 Oracle 的 Oracle 数据库一起证明了 RDBMS 市场的可行性,并使得对 RDBMS 系统的设计做出了许多持久的贡献,最引人注目的是开发了对象关系模型,该模型成为使用抽象数据类型 (ADT) 扩展数据库系统的事实上的方式,作为他的后 Ingres Postgres 项目的一部分。

Michael Stonebraker is an American computer scientist, teacher, inventor, technology entrepreneur, and intellectual leader of the database field for the last nearly 40 years. With fellow professor Eugene Wong of the University of California at Berkeley, he developed the first relational database management system (RDBMS) prototype (Ingres) that, together with IBM’s System R and Oracle’s Oracle database, proved the viability of the RDBMS market, and made many lasting contributions to the design of RDBMS systems, most notably developing the Object-Relational model that became the de facto way of extending database systems with abstract data types (ADTs), as a part of his post-Ingres Postgres project.

作为研究和行业的主要贡献者,Stonebraker 采用了一种截然不同的数据库研究方法:他强调针对现实生活中的问题,而不是更抽象的理论研究。他的研究以开放、可行的学术原型为重点,并不断努力通过商业化证明他的想法:根据他的研究创立或共同创立了九家公司。Stonebraker 已制作了超过 300 本书研究论文1(参见“迈克尔·斯通布雷克文集”,第 607 页)影响了数百名学生,并为 31 名博士生提供了建议。学生(参见“Michael Stonebraker 的学生谱系”,第 52 页),其中许多人继续取得成功的学术生涯或自己创办了成功的初创公司。

A leading contributor to both research and industry, Stonebraker took a distinctively different approach to database research: He emphasized targeting real-life problems over more abstract, theoretical research. His research is notable for its focus on open, working academic prototypes and on his repeated efforts to prove out his ideas through commercialization: founding or co-founding nine companies based on his research. Stonebraker has of this book produced more than 300 research papers1 (see “The Collected Works of Michael Stonebraker,” p. 607), influenced hundreds of students, and advised 31 Ph.D. students (see “Michael Stonebraker’s Student Genealogy,” p. 52), many of whom went on to successful academic careers or to found successful startup companies on their own.

Stonebraker 比任何其他人都更能将 Edgar (Ted) Codd 的数据独立性和关系数据库模型的愿景 [Codd 1970] 变成现实,从而创造了当今价值超过 550 亿美元的市场。Stonebraker 提出的想法几乎出现在市场上的每一个关系数据库产品中,因为他和其他人采用了他的开源代码,对其进行了改进、构建和扩展。因此,Stonebraker 对关系数据库市场产生了乘数效应。由于其开创性的想法,Stonebraker 获得了 2014 年 ACM AM 图灵奖,表彰他“对现代数据库系统底层概念和实践的基本贡献”,其中许多是作为 Ingres 和 Postgres 项目的一部分开发的。

More than any other person, Stonebraker made Edgar (Ted) Codd’s vision [Codd 1970] of data independence and the relational database model a reality, leading to the $55 billion-plus market that exists today. Stonebraker-originated ideas appear in virtually every relational database product on the market, as he and others have taken his open-source code, refined it, built on it, and extended it. As a result of this, Stonebraker has had a multiplier effect on the relational database market. For his pioneering ideas, Stonebraker received the 2014 ACM A.M. Turing Award citing his “fundamental contributions to the concepts and practices underlying modern database systems,” many of which were developed as a part of the Ingres and Postgres projects.

早期教育和教育2

Early Years and Education2

迈克尔·“迈克”·斯通布雷克 (Michael “Mike” Stonebraker) 1943 年 10 月 11 日出生于马萨诸塞州纽伯里波特,是三个儿子中的中间人,父亲是工程师,母亲是教师。他在新罕布什尔州米尔顿米尔斯长大,靠近缅因州边境。他的父母非常重视教育。当斯通布雷克十岁的时候,他的父亲举家搬到了马萨诸塞州的纽伯里,那里是著名的州长学院(前身为州长杜默学院)的所在地。由于纽伯里当时没有当地高中,该镇将为任何符合学业资格的当地居民支付走读学生学费,导致斯通布雷克的三个男孩全部就读。

Michael “Mike” Stonebraker was born on October 11, 1943, in Newburyport, Massachusetts, the middle of three sons born to an engineer and a schoolteacher. He grew up in Milton Mills, New Hampshire, near the Maine border. His parents placed a high emphasis on education; when Stonebraker was ten, his father moved the family to Newbury, Massachusetts, home to the prestigious Governor’s Academy (formerly Governor Dummer Academy). As Newbury had no local high school at the time, the town would pay day-student tuition for any local residents who could qualify academically, resulting in all three Stonebraker boys attending the school.

斯通布雷克在高中时在数学和科学方面表现出色,1961 年毕业后就读于普林斯顿大学。1965 年,他从普林斯顿大学毕业,获得电气工程学士学位。当时普林斯顿大学没有计算机课程,也没有计算机科学专业。

Stonebraker excelled at mathematics and the sciences in high school, and upon graduating in 1961, enrolled in Princeton University. He graduated with a bachelor’s degree in Electrical Engineering from Princeton in 1965. There were no computer classes at Princeton nor a computer science major at the time.

1965 年越南战争期间,作为一名大学毕业的年轻人,斯通布雷克在《迈克尔·斯通布雷克的口述历史》(2007 年毕业生)中回忆道,他有四个人生选择:“去越南,去加拿大,去监狱,或者去去读研究生。” 这个决定是显而易见的:在美国国家科学基金会 (NSF) 奖学金的支持下,他进入了密歇根大学安娜堡分校研究生院,加入了计算机信息与控制工程 (CICE) 项目,这是一个专注于计算机的工程联合项目科学。他获得了硕士学位。1967年在CICE。

As a young man graduating college in 1965 during the Vietnam War, Stonebraker recalls in the “Oral History of Michael Stonebraker” [Grad 2007] that he had four life choices: “go to Vietnam, go to Canada, go to jail, or go to graduate school.” The decision was obvious: backed by a National Science Foundation (NSF) fellowship, he enrolled in graduate school at the University of Michigan at Ann Arbor, joining the Computer Information and Control Engineering (CICE) program, a joint program in engineering focused on computer science. He received a M.Sc. in CICE in 1967.

表1.1 Michael Stonebraker的学术地位

Table 1.1 The academic positions of Michael Stonebraker

计算机科学助理教授

Assistant Professor of Computer Science

加州大学伯克利分校

University of California at Berkeley

1971–1976

1971–1976

副教授

Associate Professor

加州大学伯克利分校

University of California at Berkeley

1976年–1982年

1976–1982

教授

Professor

加州大学伯克利分校

University of California at Berkeley

1982年–1994年

1982–1994

研究生院教授

Professor of the Graduate School

加州大学伯克利分校

University of California at Berkeley

1994–1999

1994–1999

高级讲师

Senior Lecturer

麻省理工学院

Massachusetts Institute of Technology

2000年–2001年

2000–2001

客座教授

Adjunct Professor

麻省理工学院

Massachusetts Institute of Technology

2002 年至今

2002–Present

1967 年,由于他的选择也没有更好的选择,他决定留在安娜堡获得博士学位,并于 1971 年通过他的博士论文“随机链的大规模马尔可夫模型的简化”获得博士学位 [Stonebraker 1971c] —他将其描述为应用有限的理论研究。(事实上​​,人们想知道他在后来的工作中对适用性的关注是否不是对他早期研究缺乏适用性的反应。)

With his options no better in 1967, he decided to stay on at Ann Arbor to get his Ph.D., which he received in 1971 for his doctoral dissertation “The Reduction of Large Scale Markov Models for Random Chains” [Stonebraker 1971c]—which he describes as theoretical research with limited applications. (Indeed, one wonders if his focus on applicability in his later work wasn’t a reaction to the lack of applicability of his early research.)

学术生涯和安格尔的诞生

Academic Career and the Birth of Ingres

1971 年,Stonebraker 被聘为加州大学伯克利分校助理教授,在电气工程和计算机科学 (EECS) 系研究技术在公共系统中的应用。在接下来的 28 年里,他在加州大学伯克利分校担任一位有影响力且高产的教授(参见“迈克尔·斯通布雷克的学生谱系”,第 52 页),并于 1999 年以研究生院教授的身份退休,然后前往东部加入麻省理工学院(见表1.1)。

In 1971, Stonebraker was hired as an assistant professor at the University of California at Berkeley to work on the application of technology to public systems in the Electrical Engineering and Computer Science (EECS) Department. He would go on to spend the next 28 years as an influential and highly productive professor at UC Berkeley (see “Michael Stonebraker’s Student Genealogy,” p. 52), retiring as professor of the graduate school in 1999 before moving east to join MIT (see Table 1.1).

作为一名新的助理教授,他很快发现为他的公共系统工作获取数据非常困难,而且他对城市动力学(建模和应用数据来预测城市地区的增长)的兴趣并不能帮助他“获得著名的。” [2007年毕业生]

As a new assistant professor, he soon discovered that getting data for his public-systems work was very hard, and that his interest in Urban Dynamics—modeling and applying data to predicting growth in urban areas—wasn’t going to help him “get famous.” [Grad 2007]

当斯通布雷克四处寻找更多能带来名气的材料时,尤金·黄教授建议他阅读特德·科德的开创性论文。Stonebraker 还阅读了 CODASYL(数据系统语言会议)报告 [Metaxides 等人。1971]但认为后者过于复杂而予以驳回。他有一个更好的主意。在《迈克尔·斯通布雷克的口述历史》(2007 年毕业)中,他回忆道:

While Stonebraker was casting around for more fame-carrying material, Professor Eugene Wong suggested that he read Ted Codd’s seminal paper. Stonebraker also read the CODASYL (Conference on Data Systems Languages) report [Metaxides et al. 1971] but dismissed the latter as far too complicated. He had a better idea. In the “Oral History of Michael Stonebraker” [Grad 2007], he recalled:

……我不明白为什么你会想做那么复杂的事情,而特德的工作很简单,很容易理解。因此,很明显,反对者已经在说没有博士学位的人了。能够理解 Ted Codd 的谓词演算或关系代数。即使你克服了这个障碍,也没有人能够有效地实施这些东西。

… I couldn’t figure out why you would want to do anything that complicated and Ted’s work was simple, easy to understand. So it was pretty obvious that the naysayers were already saying nobody who didn’t have a Ph.D. could understand Ted Codd’s predicate calculus or his relational algebra. And even if you got past that hurdle, nobody could implement the stuff efficiently.

即使你克服了这个障碍,你也永远无法向 COBOL 程序员教授这些东西。因此,很明显,正确的做法是构建一个具有可访问查询语言的关系数据库系统。因此,吉恩 [Wong] 和我在 1972 年就着手开展这项工作。即使你不是一名火箭科学家,也能意识到这是一个有趣的研究项目。

And even if you got past that hurdle, you could never teach this stuff to COBOL programmers. So it was pretty obvious that the right thing to do was to build a relational database system with an accessible query language. So Gene [Wong] and I set out to do that in 1972. And you didn’t have to be a rocket scientist to realize that this was an interesting research project.

这个项目最终成为 Ingres 系统(参见第 5 章和第 15章)。Stonebraker 在 Ingres 项目中面临的挑战是令人畏惧的:它们无非是开发自动编程技术,将声明性查询规范转换为可执行算法,这些算法可以像熟练程序员在当今领先的商业系统上编写的代码一样高效地进行评估——所有这一切都是基于一个新的、未经验证的数据模型。值得注意的是,Stonebraker 当时是加州大学伯克利分校的助理教授,在完成博士学位两年后就开始了该项目,并与 IBM 的 System R 团队一起启动了该项目(请参阅第 35 章),开发使关系数据库成为现实的想法和方法。Ingres 的许多思想和方法至今仍在每个关系数据库系统中使用,包括使用视图和查询重写来实现数据完整性和访问控制、将持久哈希表作为主要访问方法、主副本复制控制以及实现数据库系统中的规则/触发器。此外,Ingres 项目内的实验评估为构建可提供令人满意的交易性能的锁定系统所涉及的问题提供了重要的见解。

This project eventually became the Ingres system (see Chapters 5 and 15). The challenges faced by Stonebraker in the Ingres project were daunting: they amounted to nothing less than developing automatic programming techniques to convert declarative query specifications into executable algorithms that could be evaluated as efficiently as code written by skilled programmers on the leading commercial systems of the day—all of this over a new, unproven data model. Remarkably, Stonebraker at the time was an assistant professor at UC Berkeley, starting the project just two years after completing his Ph.D., and, along with the System R team at IBM (see Chapter 35), developing the ideas and approaches that made relational databases a reality. Many of the Ingres ideas and approaches are still used by every relational database system today, including the use of views and query rewriting for data integrity and access control, persistent hash tables as a primary access method, primary-copy replication control, and the implementation of rules/triggers in database systems. Additionally, experimental evaluation within the Ingres project provided critical insights into issues involved in building a locking system that could provide satisfactory transaction performance.

由于 Ingres 和 System R 团队在这些系统的开发过程中密切沟通,因此有时很难区分个人的贡献;1988 年,他们都因其开创性工作而获得了 ACM 软件系统奖。Stonebraker 在 Ingres 中发明的中心思想之一是使用查询修改来实现视图。视图是数据库中的虚拟表,它实际上并不存在,而是定义为数据库查询。视图的概念几乎同时出现在 Ingres 和 System R 项目的论文中,但 Stonebraker 开发了算法,使其实现变得实用,并且至今仍然是数据库支持它们的方式。他还表明这些技术可用于维护数据库访问控制和完整性。

Because the Ingres and System R teams communicated closely during the development of these systems, it is sometimes difficult to tease apart the individual contributions; they both received the ACM Software System Award in 1988 for their pioneering work. One of the central ideas invented by Stonebraker in Ingres was the use of query modification to implement views. A view is a virtual table in a database that is not physically present but is instead defined as a database query. The idea of views appears nearly concurrently in papers from Ingres and the System R project, but Stonebraker developed the algorithms that made their implementation practical and that are still the way databases support them today. He also showed that these techniques could be used for preserving database access control and integrity.

Stonebraker 的想法首次出现在他 1974 年的论文“通过查询修改在关系数据库管理系统中进行访问控制”[Stonebraker 和 Wong 1974b],后来在他 1976 年的论文“通过查询修改实现完整性约束和视图”[Stonebraker 1975] 中得到了发展。 。他的主要想法是使用他称之为“交互修改”的重写技术来实现视图——本质上将对视图的查询重写为对数据库中物理表的查询。这是一个极其强大和优雅的实现思想,除了视图之外,它还被用于许多其他功能,包括授权、完整性执行和数据保护。

Stonebraker’s ideas first appear in his 1974 paper “Access Control in a Relational Database Management System By Query Modification” [Stonebraker and Wong 1974b] and are later developed in his 1976 paper “Implementation of Integrity Constraints and Views By Query Modification” [Stonebraker 1975]. His key idea is to implement views using the rewriting technique he called “interaction modification”—essentially rewriting queries over views into queries over the physical tables in the database. This is an extremely powerful and elegant implementation idea that has been adopted for many other features besides views, including authorization, integrity enforcement, and data protection.

后安格尔时代

The Post-Ingres Years

在 Ingres 之后,Stonebraker 和他的学生在 20 世纪 80 年代开始致力于后续的 Postgres 项目(参见第 16 章)。与 Ingres 一样,Postgres 也具有巨大的影响力。它是第一个支持对象关系数据模型的系统,它允许将抽象数据类型合并到关系数据库模型中。这使得程序员能够“将代码移动到数据”,将复杂的抽象数据类型和操作直接嵌入到数据库中。

After Ingres, Stonebraker and his students began working on the follow-on Postgres project in the 1980s (see Chapter 16). Like Ingres, Postgres was hugely influential. It was the first system to support the Object-Relational data model, which allows the incorporation of abstract data types into the relational database model. This enabled programmers to “move code to data,” embedding sophisticated abstract data types and operations on them directly inside of the database.

Stonebraker 在他 1986 年的论文“Postgres 使用过程中的对象管理”[Stonebraker 1986c] 中描述了这个模型,包括对语言的扩展和对数据库实现的必要修改,其中包括通过早期的“用户定义函数”实验获得的见解。 20世纪80年代初的安格尔。与当时提出将持久对象集成到面向对象编程语言中的流行思想相反,Stonebraker 的思想使关系模型得以蓬勃发展,同时获得各种丰富的新数据类型的好处。同样,这个想法被用于每个现代数据库系统中。近年来,随着数据库供应商将其系统发展成为支持复杂统计的“分析平台”,有关“大数据”的讨论也变得越来越重要,机器学习和推理算法。这些功能是使用 Stonebraker 首创的对象关系接口提供的。与 Ingres 一样,Postgres 系统探索了许多其他远远超前的激进想法,包括数据库中不可变数据和历史“时间旅行”查询的概念,以及使用持久内存来实现轻量级事务方案。

Stonebraker described this model, including extensions to the language and necessary modifications to the database implementation, in his 1986 paper “Object Management in Postgres Using Procedures” [Stonebraker 1986c], which included insights learned through earlier experiments with “user-defined functions” in Ingres in the early 1980s. In contrast to the prevailing ideas of the time that proposed integrating persistent objects into object-oriented programming languages, Stonebraker’s idea allowed the relational model to thrive while obtaining the benefits of a variety of rich new data types. Again, this idea is used in every modern database system. It has also become increasingly important in recent years, with all the buzz about “Big Data” as database vendors have grown their systems into “analytic platforms” that support complex statistics, machine learning, and inference algorithms. These features are provided using the Object-Relational interfaces pioneered by Stonebraker. Like Ingres, the Postgres system explored a number of other radical ideas that were well before their time, including notions of immutable data and historical “time travel” queries in databases, and the use of persistent memory to implement lightweight transaction schemes.

在 Ingres 和 Postgres 中,Stonebraker 的想法的影响被极大地放大了,因为它们体现在强大的实现中,被广泛使用。这些系统经过精心设计,构成了许多现代数据库系统的基础。例如,Ingres 用于构建 Sybase SQL Server(后来成为 Microsoft SQL Server),而 Postgres 已被用作过去 20 年来开发的许多新商业数据库的基础,包括 Illustra、Informix、Netezza、和格林普拉姆。要完整了解 Ingres 对其他 DBMS 的影响,请参阅第 13 章中的 RDBMS 谱系。

In both Ingres and Postgres, the impact of Stonebraker’s ideas was amplified tremendously by the fact that they were embodied in robust implementations that were widely used. These systems were so well engineered that they form the basis of many modern database systems. For example, Ingres was used to build Sybase SQL Server, which then became Microsoft SQL Server, and Postgres has been used as the basis for many of the new commercial databases developed over the last 20 years, including those from Illustra, Informix, Netezza, and Greenplum. For a complete view of Ingres’ impact on other DBMSs, see the RDBMS genealogy in Chapter 13.

工业、麻省理工学院和新千年

Industry, MIT, and the New Millennium

在 Postgres 项目之后,Stonebraker 开始大量涉足工业界,最著名的是在 Illustra Information Technologies, Inc.,将 Postgres 项目的想法转变为主要的商业数据库产品。1997 年,Illustra 被 Informix, Inc. 收购,并任命 Stonebraker 担任 CTO,直到他于 2000 年离职。

After the Postgres project, Stonebraker became heavily involved in industry, most notably at Illustra Information Technologies, Inc., turning ideas from the Postgres project into a major commercial database product. In 1997, Illustra was acquired by Informix, Inc., which brought on Stonebraker as CTO until his departure in 2000.

1999 年,斯通布雷克搬到新罕布什尔州,并于 2001 年开始在麻省理工学院担任兼职教授。尽管已经取得了一系列令人印象深刻的学术和商业成功,Stonebraker 在本世纪初启动了一系列引人注目的研究项目和商业公司,首先是 Aurora 和 StreamBase 项目(见第 17 章),他与布兰代斯大学、布​​朗大学和麻省理工学院的同事共同创立了该机构。这些项目探索了管理数据流的想法,使用新的数据模型和查询语言,重点是从外部数据源(例如传感器和互联网数据源)持续到达的数据项序列。Stonebraker 于 2003 年与他人共同创立了 StreamBase Systems,将 Aurora 和 Borealis 开发的技术商业化。StreamBase 于 2013 年被 TIBCO Software Inc. 收购。

In 1999, Stonebraker moved to New Hampshire, and in 2001 started as an adjunct professor at MIT. Despite already having compiled an impressive array of academic and commercial successes, Stonebraker launched a remarkable string of research projects and commercial companies at the start of the millennium, beginning with the Aurora and StreamBase projects (see Chapter 17), which he founded with colleagues from Brandeis University, Brown University, and MIT. The projects explored the idea of managing streams of data, using a new data model and query language where the focus was on continuously arriving sequences of data items from external data sources such as sensors and Internet data feeds. Stonebraker co-founded StreamBase Systems in 2003 to commercialize the technology developed in Aurora and Borealis. StreamBase was acquired by TIBCO Software Inc. in 2013.

2005 年,Stonebraker 再次与 Brandeis、Brown 和 MIT 的同事一起启动了 C-Store 项目,其目标是开发一种新型数据库系统,专注于所谓的数据分析,其中长期运行、扫描密集型查询在大型、不经常更新的数据库上运行,而不是事务性工作负载,事务性工作负载侧重于对单个数据库记录的许多小型并发读取和写入。C-Store 是一个无共享、面向列的数据库,专为这些工作负载而设计(请参阅第 18 章))。通过将数据存储在列中,并仅访问回答特定查询所需的列,C-Store 比将数据存储在行中的传统系统具有更高的输入/输出效率(I/O 效率),从而提供了 order-of-与当时领先的商业系统相比,速度大幅提升。同年,Stonebraker 与他人共同创立了 Vertica Systems, Inc.,将便利店背后的技术商业化;Vertica Systems 于 2011 年被 Hewlett-Packard, Inc. 收购。尽管 Stonebraker 并不是第一个提出这一建议的人Vertica 的成功导致了采用面向列设计的商业系统的激增,包括 Microsoft 并行数据仓库项目(现已作为列存储索引纳入 Microsoft SQL Server)和 Oracle 内存列店铺。

In 2005, again with colleagues from Brandeis, Brown, and MIT, Stonebraker launched the C-Store project, where the aim was to develop a new type of database system focused on so-called data analytics, where long-running, scan-intensive queries are run on large, infrequently updated databases, as opposed to transactional workloads, which focus on many small concurrent reads and writes to individual database records. C-Store was a shared-nothing, column-oriented database designed for these workloads (see Chapter 18). By storing data in columns, and accessing only the columns needed to answer a particular query, C-Store was much more input/output-efficient (I/O-efficient) than conventional systems that stored data in rows, offering order-of-magnitude speedups vs. leading commercial systems at the time. That same year, Stonebraker co-founded Vertica Systems, Inc., to commercialize the technology behind C-Store; Vertica Systems was acquired in 2011 by Hewlett-Packard, Inc. Although Stonebraker wasn’t the first to propose column-oriented database ideas, the success of Vertica led to a proliferation of commercial systems that employed column-oriented designs, including the Microsoft Parallel Data Warehouse project (now subsumed into Microsoft SQL Server as Column Store indexes) and the Oracle In-Memory Column Store.

离开 C-Store 后,Stonebraker 继续他的一系列学术和工业活动,包括以下内容。

After C-Store, Stonebraker continued his string of academic and industrial endeavors, including the following.

• 2006 年,Morpheus 项目(后来成为 Goby, Inc.)专注于数据集成系统,将不同的基于 Web 的数据源转换为具有一致模式的统一视图。

•  In 2006, the Morpheus project, which became Goby, Inc., focused on a data integration system to transform different web-based data sources into a unified view with a consistent schema.

• 2007 年,H-Store 项目后来成为 VoltDB, Inc.,专注于在线事务处理(参见第 19 章)。

•  In 2007, the H-Store project, which became VoltDB, Inc., focused on online transaction processing (see Chapter 19).

• 2008 年,SciDB 项目更名为 Paradigm4, Inc.,重点关注面向阵列的数据存储和科学应用(请参阅第 20 章)。

•  In 2008, the SciDB project, which became Paradigm4, Inc., focused on array-oriented data storage and scientific applications (see Chapter 20).

• 2011 年,Data Tamer 项目更名为 Tamr Inc.,专注于大规模数据集成和统一(请参阅第21 章)。

•  In 2011, the Data Tamer project, which became Tamr Inc., focused on massive-scale data integration and unification (see Chapter 21).

碎石者的遗产

Stonebraker’s Legacy

Stonebraker 的系统如此有影响力有两个原因。

Stonebraker’s systems have been so influential for two reasons.

首先,Stonebraker 设计了他的系统,以提供性能和可用性,使它们能够在 40 年后仍然存在。实现这一目标所需的工程技术在两篇极具影响力的系统论文中得到了简洁的描述,即 1976 年的论文“Ingres 的设计和实现”[Stonebraker 等人,2017]。1976b] 和 1986 年论文“Postgres 的实现”[Stonebraker 等人。1990b]。这些论文教会了很多代学生和从业者如何构建数据库系统。

First, Stonebraker engineered his systems to deliver performance and usability that allowed them to still live on 40 years later. The engineering techniques necessary to achieve this are succinctly described in two extremely influential systems papers, the 1976 paper “The Design and Implementation of Ingres” [Stonebraker et al. 1976b] and the 1986 paper “The Implementation of Postgres” [Stonebraker et al. 1990b]. These papers have taught many generations of students and practitioners how to build database systems.

其次,Stonebraker 系统构建在商品 Unix 平台上并作为开源发布。在“低端”Unix 机器上构建这些系统需要仔细考虑数据库系统将如何使用操作系统。这与许多以前在“裸机”上构建为垂直堆栈的数据库系统形成鲜明对比。Stonebraker 1986 年的论文“数据管理的操作系统支持”[Stonebraker 和 Kumar 1986] 中描述了在 Unix 上构建数据库系统的挑战以及许多建议的解决方案。Ingres 和 Postgres 原型以及这篇开创性的论文推动操作系统实现者和数据库设计者达成相互理解,这已成为现代操作系统支持长期运行、I/O 密集型服务的关键,这些服务支撑着许多现代可扩展系统。此外,通过发布他的系统作为开源,Stonebraker 促进了学术界和工业界的创新,因为许多新的数据库和研究论文都是基于这些工件的。有关 Stonebraker 开源系统影响的更多信息,请参阅第 12 章

Second, Stonebraker systems were built on commodity Unix platforms and released as open source. Building these systems on “low end” Unix machines required careful thinking about how database systems would use the operating system. This is in contrast to many previous database systems that were built as vertical stacks on “bare metal.” The challenges of building a database system on Unix, and a number of proposed solutions, were described in Stonebraker’s 1986 paper “Operating System Support for Data Management” [Stonebraker and Kumar 1986]. The Ingres and Postgres prototypes and this seminal paper pushed both OS implementers and database designers toward a mutual understanding that has become key to modern OS support of the long-running, I/O-intensive services that underpin many modern scalable systems. Furthermore, by releasing his systems as open source, Stonebraker enabled innovation in both academia and industry, as many new databases and research papers were based on these artifacts. For more on the impact of Stonebraker’s open-source systems, see Chapter 12.

总之,Stonebraker 负责现代数据库系统的大部分软件基础。Charlie Bachman、Ted Codd 和 Jim Gray 对数据管理做出了巨大贡献,他们都获得了图灵奖。与他们一样,Stonebraker 开发了每个现代关系数据库系统中使用的许多基本思想。然而,斯通布雷克比任何其他人都更证明了将关系模型从理论付诸实践是可能的。Stonebraker 的软件工件继续作为开源和商业产品继续存在,其中包含世界上许多重要数据。他的想法继续影响着许多新数据处理系统的设计和功能——不仅是关系数据库,还有最近受到关注的“大数据”系统。值得注意的是,第 3 章),贡献了创新研究(参见第 22 章和第 23 章并通过他在流处理、列存储、科学数据库和事务处理方面的工作产生了巨大的知识和商业影响。这些都是 Stonebraker 荣获 ACM AM 图灵奖的原因。

In summary, Stonebraker is responsible for much of the software foundation of modern database systems. Charlie Bachman, Ted Codd, and Jim Gray made monumental contributions to data management, earning each of them the Turing Award. Like them, Stonebraker developed many of the fundamental ideas used in every modern relational database system. However, more than any other individual, it was Stonebraker who showed that it was possible to take the relational model from theory to practice. Stonebraker’s software artifacts continue to live on as open-source and commercial products that contain much of the world’s important data. His ideas continue to impact the design and features of many new data processing systems—not only relational databases but also the “Big Data” systems that have recently gained prominence. Remarkably, Stonebraker continues to be extremely productive and drive the research agenda for the database community (see Chapter 3), contributing innovative research (see Chapters 22 and 23), and having massive intellectual and commercial impact through his work on stream processing, column stores, scientific databases, and transaction processing. These are the reasons Stonebraker won the ACM A.M. Turing Award.

公司

Companies

截至撰写本文时,斯通布雷克已经创立/共同创立了九家初创公司,以将他的学术系统商业化。

As of this writing, Stonebraker had founded/co-founded nine startup companies to commercialize his academic systems.

• Relational Technology, Inc.,成立于 1980 年;更名为安格尔公司 (1989);被 Ask (1990) 收购,被 Computer Associates (1994) 收购;分拆为一家私营公司 Ingres Corp. (2005);收购 VectorWise (2010);更名为 Actian(2011 年);收购 Versant 公司(2012 年);收购 Pervasive Software (2013);收购 ParAcceJ (2013)。2016 年,Actian 逐步淘汰了 ParAccel、VectorWise 和 DataFlow,保留了 Ingres。

•  Relational Technology, Inc., founded 1980; became Ingres Corporation (1989); acquired by Ask (1990), acquired by Computer Associates (1994); spun out as a private company Ingres Corp. (2005); acquired VectorWise (2010); name change to Actian (2011); acquired Versant Corporation (2012); acquired Pervasive Software (2013); acquired ParAcceJ (2013). In 2016, Actian phased out ParAccel, VectorWise, and DataFlow, retaining Ingres.

• Illustra 信息技术公司,成立于 1992 年;1996 年被 Informix 收购,Stonebraker 在 1996 年至 2000 年间担任 Informix 首席技术官,后来被 IBM 收购。

•  Illustra Information Technologies, founded 1992; acquired in 1996 by Informix, where Stonebraker was Chief Technology Officer 1996–2000, and was later acquired by IBM.

• Cohera Corporation,成立于 1997 年(被 PeopleSoft 收购)。

•  Cohera Corporation, founded 1997 (acquired by PeopleSoft).

• StreamBase Systems,成立于2003 年(2013 年被TIBCO 收购)。

•  StreamBase Systems, founded 2003 (acquired by TIBCO in 2013).

• Vertica Systems,成立于 2005 年(2011 年被 HP 收购)。

•  Vertica Systems, founded 2005 (acquired by HP in 2011).

• Goby,成立于2008 年(2011 年被Telenauv 收购)。

• Goby, founded in 2008 (acquired by Telenauv in 2011).

• VoltDB,成立于2009 年。

•  VoltDB, founded in 2009.

• Paradigm4,成立于2010 年。

•  Paradigm4, founded in 2010.

• Tamr,成立于2013年。

•  Tamr, founded in 2013.

奖项与荣誉

Awards and Honors

有关 Michael Stonebraker 的奖项和荣誉的更多信息,请参阅表 1.2 。

See Table 1.2 for further information on Michael Stonebraker’s awards and honors.

服务

Service

有关 Michael Stonebraker 服务的更多详细信息,请参阅表 1.3 。

See Table 1.3 for further details on Michael Stonebraker’s service.

表1.2 Michael Stonebraker所获奖项及荣誉

Table 1.2 Awards and honors of Michael Stonebraker

ACM软件系统奖

ACM Software System Award

1988年

1988

ACM SIGMOD 创新奖

ACM SIGMOD Innovation Award

1994年

1994

ACM院士

ACM Fellow

1993年

1993

ACM SIGMOD“Test of Time”奖(最佳论文,10 年后)

ACM SIGMOD “Test of Time” Award (best paper, 10 years later)

1997年、2017年

1997, 2017

美国国家工程院

National Academy of Engineering

1998年当选

Elected 1998

IEEE 约翰·​​冯·诺依曼奖章

IEEE John von Neumann Medal

2005年

2005

美国艺术与科学学院

American Academy of Arts and Sciences

2011年

2011

艾伦·图灵奖

Alan M. Turing Award

2014年

2014

表 1.3 Michael Stonebraker 的服务

Table 1.3 Service of Michael Stonebraker

ACM SIGMOD 主席

Chairman of ACM SIGMOD

1981–1984

1981–1984

1987 年 SIGMOD 主席

General Chairperson, SIGMOD 1987

1987年

1987

SIGMOD 项目主席 1992

Program Chairperson SIGMOD 1992

1992年

1992

拉古纳海滩研讨会组织者

Organizer, the Laguna Beach Workshop

1988年

1988

Asilomar DBMS 研讨会组织者

Organizer, the Asilomar DBMS Workshop

1996年

1996

SIGMOD 奖项委员会

SIGMOD Awards Committee

2001年–2005年

2001–2005

CIDR 远见 DBMS 研究会议联合创始人(与 David J. DeWitt 和 Jim Gray 一起)

Co-founder, CIDR Conference on Visionary DBMS Research (with David J. DeWitt and Jim Gray)

2002年

2002

ACM系统软件奖委员会

ACM System Software Awards Committee

2006–2009

2006–2009

宣传

Advocacy

除了学术和工业成就之外,Stonebraker 还是一位技术传统打破者,是数据库社区事实上的领导者,并且是关系数据库的不懈倡导者。有时会引起争议,他勇于质疑各种面向数据的技术的技术方向,包括:

In addition to his academic and industrial accomplishments, Stonebraker has been a technical iconoclast, acted as the de facto leader of the database community, and served as a relentless advocate for relational databases. Sometimes controversially, he’s been unafraid to question the technical direction of a variety of data-oriented technologies, including:

• 提倡使用对象关系数据模型,而不是 20 世纪 80 年代中期许多其他研究小组和公司所追求的面向对象方法。

•  Advocated in favor of the Object-Relational data model over the object-oriented approach pursued by many other research groups and companies in the mid-1980s.

• 反对“一刀切”数据库的传统理念,支持特定领域的设计,如C-Store、H-Store 和SciDB [Stonebraker 和Çetintemel 2005]。

•  Argued against the traditional idea of the “one-size-fits-all” database in favor of domain specific designs like C-Store, H-Store, and SciDB [Stonebraker and Çetintemel 2005].

• 面对强烈反对,批评较弱的数据管理解决方案,包括NoSQL [Stonebraker 2010a,Stonebraker 2011b]、MapReduce [DeWitt 和Stonebraker 2008,Stonebraker 等人。2010] 和 Hadoop [Barr 和 Stonebraker 2015a]。

•  Criticized weaker data management solutions in the face of vocal opposition, including NoSQL [Stonebraker 2010a, Stonebraker 2011b], MapReduce [DeWitt and Stonebraker 2008, Stonebraker et al. 2010], and Hadoop [Barr and Stonebraker 2015a].

• 共同创办了每两年一次的创新数据系统研究会议(CIDR),以解决现有会议的缺陷,并举办了许多特殊目的研讨会。

•  Co-founded the biennial Conference on Innovative Data Systems Research (CIDR) to address shortcomings of existing conferences, and created many special-purpose workshops.

个人生活

Personal Life

斯通布雷克 (Stonebraker) 与贝丝·斯通布雷克 (Beth Stonebraker) 结婚,贝丝·斯通布雷克 (Beth Stonebraker),姓氏3岁,拉布 (Rabb)。他们有两个已成年的女儿,莱斯利和桑德拉。他们在波士顿后湾(他骑自行车往返于家乡、麻省理工学院 CSAIL 和他在剑桥的初创公司之间)和新罕布什尔州温尼珀索基湖都有住所,在那里他攀登了所有 48 座 4,000 英尺高的山脉,并乘船在湖水中穿梭。 。他们为两个社区的许多事业做出了贡献。他演奏五弦班卓琴,(如果你问得好的话)他可能会安排他的蓝草拾音乐队“Shared Nothing”与其他音乐家和计算机科学家约翰·“JR”·罗宾逊和斯坦·兹多尼克(后者来自布朗大学)一起表演。 )。

Stonebraker is married to Beth Stonebraker,3 née Rabb. They have two grown daughters, Lesley and Sandra. They have homes in Boston’s Back Bay (where he bicycles between home, MIT CSAIL, and his Cambridge startups) and on Lake Winnipesaukee in New Hampshire, where he has climbed all forty-eight 4,000-foot mountains and plies the lake waters in his boat. They are contributors to many causes in both communities. He plays the five-string banjo and (if you ask nicely) he may arrange a performance of his bluegrass pickup band, “Shared Nothing,” with fellow musicians and computer scientists John “JR” Robinson and Stan Zdonik (the latter of Brown University).

致谢

Acknowledgments

作者谨此感谢 Janice L. Brown 对本章的贡献。

The author wishes to acknowledge Janice L. Brown for her contributions to this chapter.

迈克·斯通布雷克的学生谱系图

Mike Stonebraker’s Student Genealogy Chart

图像
图像

注:蓝色框中的名字没有后代。

Notes: Names in blue boxes have no descendants.

*塞缪尔·马登是迈克尔·富兰克林和约瑟夫·海勒斯坦的后裔。

*Samuel Madden is a descendent of both Michael Franklin and Joseph Hellerstein.

Harv = 哈佛大学;HKUST = 香港科技大学;IISc = 印度科学研究所;IITB = 印度理工学院,孟买;IU = 印第安纳大学;METU = 中东技术大学;MIT=麻省理工学院;NTUA = 雅典国立技术大学;NUS = 新加坡国立大学;NYU = 纽约大学;RPI = 伦斯勒理工学院;UCB = 加州大学伯克利分校;UF = 佛罗里达大学;UMD = 马里兰大学学院公园分校;UniPi = 比雷埃夫斯大学;UoC = 克里特大学;UPenn = 宾夕法尼亚大学;UW = 威斯康星大学麦迪逊分校;耶鲁=耶鲁大学。

Harv = Harvard University; HKUST = Hong Kong University of Science and Technology; IISc = Indian Institute of Science; IITB = Indian Institute of Technology, Bombay; IU = Indiana University; METU = Middle East Technical University; MIT = Massachusetts Institute of Technology; NTUA = National Technical University of Athens; NUS = National University of Singapore; NYU = New York University; RPI = Rensselaer Polytechnic Institute; UCB = University of California, Berkeley; UF = University of Florida; UMD = University of Maryland, College Park; UniPi = University of Piraeus; UoC = University of Crete; UPenn = University of Pennsylvania; UW = University of Wisconsin, Madison; Yale = Yale University.

迈克·斯通布雷克的职业生涯:图表(作者:A. Pavlo)

The Career of Mike Stonebraker: The Chart (by A. Pavlo)

图像
图像
图像

这张照片拍摄于 2013 年在加州大学欧文分校贝克曼中心(NAE West),拍摄了 Stonebraker 学生家谱图表中的五代数据库研究人员。加州大学欧文分校的 Michael J. Carey(Stonebraker 在加州大学伯克利分校时的博士生);Michael J. Franklin,芝加哥大学计算机科学系系主任(凯里在威斯康星大学时的博士生);MIT CSAIL 的 Samuel Madden(富兰克林在加州大学伯克利分校时的博士生);和马里兰大学帕克分校的丹尼尔·阿巴迪(Madden 的前博士生)。前三人计划沿着链条努力,让其他人合法地将自己的名字改为“迈克”。

Five generations of database researchers from the Stonebraker Student Genealogy chart are captured in this photo, taken at UC Irvine’s Beckman Center (NAE West) in 2013. From left are Michael Stonebraker; Michael J. Carey of UC Irvine (Ph.D. student of Stonebraker when he was at UC Berkeley); Michael J. Franklin, Chairman of the University of Chicago’s Department of Computer Science (Ph.D. student of Carey when he was at University of Wisconsin); Samuel Madden of MIT CSAIL (Ph.D. student of Franklin when he was at UC Berkeley); and Daniel Abadi of the University of Maryland, College Park (former Ph.D. student of Madden). The first three plan to work their way down the chain to get the others to legally change their names to “Mike.”

1 . 资料来源:DBLP ( http://dblp.uni-trier.de/pers/hd/s/Stonebraker:Michael。上次访问时间为 2018 年 4 月 8 日。

1. Source: DBLP (http://dblp.uni-trier.de/pers/hd/s/Stonebraker:Michael. Last accessed April 8, 2018.

2 . 本节和下一节借鉴了 Grad [2007] 的优秀资源。

2. This section and the next draw on the excellent resource by Grad [2007].

3 . Stonebraker 在越野双人自行车旅行中的副驾驶在 Stonebraker [2016] 中讲述了这一点。

3. Stonebraker’s co-pilot in the cross-country tandem-bicycle trip recounted this in Stonebraker [2016].

第三部分

PART III

迈克·斯通布雷克 (Mike Stonebraker) 畅所欲言:玛丽安·温斯莱特 (Marianne Winslett) 专访

MIKE STONEBRAKER SPEAKS OUT: AN INTERVIEW WITH MARIANNE WINSLETT

2

2

迈克·斯通布雷克 (Mike Stonebraker) 畅所欲言:采访

Mike Stonebraker Speaks Out: An Interview

玛丽安·温斯莱特

Marianne Winslett

欢迎来到 ACM SIGMOD Record 对数据库社区杰出成员的系列访谈。1我是玛丽安·温斯莱特,今天我们在美国新罕布什尔州的温尼珀索基湖。和我在一起的还有迈克尔·斯通布雷克 (Michael Stonebraker),他是一位连续创业者,也是麻省理工学院的教授,此前曾在伯克利分校工作多年。Mike 获得了 2014 年图灵奖,因为他证明了数据关系模型不仅仅是一个白日梦,而且在现实世界中是可行且有用的。迈克的博士学位。来自密歇根大学。那么,迈克,欢迎!

Welcome to ACM SIGMOD Record’s series of interviews with distinguished memberstex of the database community.1 I’m Marianne Winslett, and today we are at Lake Winnipesaukee in New Hampshire, USA. I have here with me Michael Stonebraker, who is a serial entrepreneur and a professor at MIT, and before that for many years at Berkeley. Mike won the 2014 Turing Award for showing that the relational model for data was not just a pipe dream, but feasible and useful in the real world. Mike’s Ph.D. is from the University of Michigan. So, Mike, welcome!

迈克尔·斯通布雷克:谢谢你,玛丽安。

Michael Stonebraker: Thank you, Marianne.

玛丽安·温斯莱特:三十五年前,您告诉一位朋友,赢得图灵奖将是您最自豪的时刻。虽然雄心壮志并不是你成功的唯一因素,但我认为,如此雄心勃勃从第一天起就会产生巨大的变化。

Marianne Winslett: Thirty-five years ago, you told a friend that winning the Turing Award would be your proudest moment. While ambition was hardly the only factor in your success, I think that being so ambitious would have made a huge difference from day one.

Stonebraker:我认为,如果你决定成为一名助理教授,你就必须有狂热的雄心,否则就太难了。如果你不只是有动力——没有真正动力的人就会失败。我认识的教授们都非常渴望取得成就。那些不这样做的人去做其他事情。

Stonebraker: I think that if you decide to become an assistant professor, you’ve got to be fanatically ambitious, because it’s too hard otherwise. If you’re not just driven—people who aren’t really driven fail. The professors I know are really driven to achieve. Those who aren’t go do other things.

温斯莱特:这是伯克利和麻省理工学院特有的吗?或者您对计算机科学教授的总体情况有何看法?

Winslett: Would that be specific to Berkeley and MIT? Or you think for computer science professors in general?

Stonebraker:我认为,如果你在任何一所知名大学(伊利诺伊大学也不例外),要么出版,要么灭亡,而获得终身教职的唯一方法就是真正有动力。否则就太难了。

Stonebraker: I think that if you’re at any big-name university—Illinois is no exception—that it’s publish or perish, and the only way to get tenure is to really be driven. Otherwise it’s just too hard.

温斯莱特:确实如此,但出版并不等于产生影响,而你已经产生了很大的影响。您在学生身上看到的其他性格特征(例如竞争力)是否是影响他们职业生涯的一个重要因素?

Winslett: That’s true, but publishing is not the same thing as having impact, and you’ve had a lot of impact. Are there other character traits that you see in students, like competitiveness, that have been a big factor in the impact they’ve had in their careers?

Stonebraker:我的总体感觉是你必须真正有动力。此外,我认为如果你不是在两三打知名大学中的一所,就很难真正产生影响,因为你的研究生不是那么优秀。我认为你必须有优秀的研究生,否则很难成功。

Stonebraker: My general feeling is that you have to be really driven. Furthermore, I think if you’re not at one of two or three dozen big-name universities, it’s hard to really have impact because the graduate students you have aren’t that good. I think you’ve got to have good graduate students or it’s very difficult to succeed.

任何为我工作的人都必须学习如何编码,尽管我不擅长编码,因为我让每个人都实际做事,而不仅仅是写理论。在我们的领域,仅靠纸笔做事很难产生影响。

Anyone who works for me has to learn how to code, even though I’m horrible at coding, because I make everybody actually do stuff rather than just write theory. In our field, it’s really hard to have an impact just doing paper-and-pencil stuff.

温斯莱特:您以给教授的建议的形式表达了您的建议。对于工业界人士或工业研究实验室的人来说会有不同吗?

Winslett: You couched your advice in terms of advice for professors. Would it be different for people in industry or at a research lab in industry?

Stonebraker:也有一些例外,但我认为总的来说,影响最大的人是大学里的人。

Stonebraker: There are some exceptions, but I think by and large, the people who’ve made the biggest impact have been at universities.

工业研究实验室有两个主要问题。第一个是,构建原型的最佳方法是由一位酋长和一些印度人共同参与,而工业研究实验室通常不存在这种方法。我认为 System R 能够将近十二位负责人聚集在一起并让一些东西发挥作用,这真是一个奇迹。

Industrial research labs have two main problems. The first one is that the best way to build prototypes is with a chief and some Indians, and that generally doesn’t exist at industrial research labs. I think it’s a marvel that System R managed to put together nearly a dozen chiefs and get something to work.

问题二是,如果你在一所知名大学,如果你不带钱,你就什么也做不了。你必须具有创业精神,你必须成为一名推销员,你必须筹集资金,而这些都是你在工业研究实验室不需要具备的特征。真正有进取心的人会自我选择进入大学。

Problem two is that if you’re at a big-name university, if you don’t bring in money you can’t get anything done. You have to be entrepreneurial, you’ve got to be a salesman, you’ve got to raise money, and those are characteristics you don’t have to have at an industrial research lab. The really aggressive people self-select themselves into universities.

Winslett:作为我的其他工作之一( ACM Transactions on the Web的联合主编),我会浏览与网络有任何关系的主要会议,并查看最佳论文奖获得者。令人惊讶的是,现在其中有多少人来自工业界。

Winslett: As one of my other jobs (co-editor-in-chief of ACM Transactions on the Web), I go through the major conferences that are related in any way to the Web and look at the best paper prize winners. It is amazing how many of those now come from industry.

Stonebraker:基本上所有对网络研究的贡献都涉及大数据,互联网公司拥有所有数据,但他们不与学术界分享。我认为如果不在网络公司工作,就很难在网络研究方面做出重大贡献。在硬件、网络中,肯定有一些领域很难在学术界做出贡献。

Stonebraker: Essentially all the contributions to Web research involve big data, and the Internet companies have all the data and they don’t share it with academia. I think it’s very difficult to make significant contributions in Web research without being at a Web company. In hardware, in the Web—there are definitely areas where it’s hard to make a contribution in academia.

温斯莱特:但在数据库的基础设施方面,您认为作为学术研究人员仍然有可能产生强大的影响。您不必去数据所在的地方。

Winslett: But in the infrastructure side of databases, you think it’s still possible to have strong impact as an academic researcher. You don’t have to go where the data is.

斯通布雷克:对。我觉得真正有趣的是,如果你看看数据库公司的贡献,你会发现它们的数量很少而且相距甚远。好的想法仍然主要来自大学。然而,地平线上有乌云密布。无论出于何种原因,公司都乐意让供应商查看他们的数据,但他们不想与其他任何人共享。我最喜欢的例子是我正在寻找有关数据库崩溃的数据——为什么数据库系统会崩溃?

Stonebraker: Right. The thing I find really interesting is that if you look at the contributions that have come from the database companies, they’re few and far between. Good ideas still come primarily from universities. However, there are storm clouds on the horizon. For whatever reason, companies are happy to let the vendor look at their data, but they don’t want to share it with anybody else. My favorite example is that I was looking for data on database crashes—why do database systems crash?

我有一只非常大的鲸鱼,它愿意分享他们的数据库崩溃日志。这件事之所以发生,是因为第一,公司不想让人们知道他们的正常运行时间有多短,第二,他们的供应商不想让人们知道他们崩溃的频率。我认为运营数据的问题在于它往往会让人脸上蒙羞,这让事情变得困难。

I had a very large whale who was willing to share their logs of database crashes. That went down the tubes because number one, the company didn’t want people to know how low their uptime was, and number two, their vendor didn’t want people to know how often they crashed. I think the trouble with operational data is that it tends to put egg on somebody’s face and that makes it difficult.

温斯莱特:我完全同意,那么你是如何在学术方面产生影响的呢?

Winslett: I agree completely, so how do you still manage to have an impact coming from the academic side?

Stonebraker:我认为产生影响的最简单方法就是做一些有趣的事情,然后获得风险投资的支持,将其变成现实。安格尔实际上确实有效。Postgres 确实有效,但从那时起我构建的每个系统都只是一瘸一拐地前进,因为它太困难了。你得到一个几乎无法工作的原型,然后你获得风险投资资金将其变成现实,然后你就可以在市场上与大象竞争。

Stonebraker: I think that the easiest way to have an impact is to do something interesting and then get venture capital backing to turn it into something real. Ingres actually really worked. Postgres really worked, but every system I’ve built since then just barely limped along because it got too difficult. You get a prototype that just barely works and then you get VC money to make it real and then you go and compete in the marketplace against the elephants.

克莱顿·克里斯滕森(Clayton Christensen)有一本精彩的书,名为《创新者的困境》。基本上,它表明,如果您销售旧技术,则很难在不失去客户群的情况下转变销售新技术。这使得大型数据库公司对新想法不太感兴趣,因为新想法会蚕食他们现有的基础。如果你想有所作为,你要么尝试让数据库公司对你正在做的事情感兴趣,要么你自己创办一家初创公司。如果你不做其中之一,那么我认为你的影响是有限的。我认识的每个人都有兴趣创办一家公司来做出改变。要将您的想法真正付诸实践,这是唯一的方法。

There’s a fabulous book by Clayton Christensen called The Innovator’s Dilemma. Basically, it says that if you’re selling the old technology, it’s very difficult to morph selling the new technology without losing your customer base. This makes the large database companies not very interested in new ideas, because new ideas would cannibalize their existing base. If you want to make a difference, you either try to interest a database company in what you’re doing, or you do a startup. If you don’t do one or the other, then I think your impact is limited. Everyone I know is interested in starting a company to make a difference. To get your ideas really into the world, that’s the only way to do it.

温斯莱特:图灵奖已经在手,您还想在职业生涯中实现什么目标?

Winslett: With the Turing Award already in hand, what else would you like to accomplish in your career?

Stonebraker:此时我已经 73 岁了。我不知道有哪个 80 岁的研究人员仍然有活力,所以我的目标很简单,尽可能长时间地保持活力,并希望意识到自己何时已经脱离困境并优雅地退休到场边。我只是想保持竞争力。

Stonebraker: At this point I’m 73 years old. I don’t know of any 80-year-old researchers who are still viable, and so my objective is very simple, to stay viable as long as I can, and to hopefully realize when I’ve fallen off the wagon and gracefully retire to the sidelines. I’m just interested in staying competitive.

温斯莱特:关于保持竞争力:你的一位同事说,“迈克因只喜欢自己的想法而臭名昭著,这当然是有道理的,因为他常常是对的。” 告诉我一次你在重大技术问题上改变主意的经历。

Winslett: Regarding staying competitive: One of your colleagues says that “Mike is notorious for only liking his own ideas, which is certainly justifiable because he is often right.” Tell me about a time you changed your mind on a major technical matter.

Stonebraker:我认为我最大的失败是我在 20 世纪 70 年代、80 年代甚至 90 年代是分布式数据库的超级粉丝,但这些东西没有商业市场。相反,并行数据库系统有一个巨大的市场,它是具有不同架构的分布式数据库系统,但我没有意识到这就是大市场所在。我完全错过了。我本可以写《Gamma》,但我没有。这是我完全错过的一个主要主题,我花了很长时间才意识到,出于各种充分的理由,分布式数据库系统确实没有市场。归根结底,现实世界是最终的陪审团,我慢慢地意识到没有市场。

Stonebraker: I think my biggest failure was that I was a huge fan of distributed databases in the 1970s and the ’80s and even in the ’90s, and there’s no commercial market for that stuff. Instead there’s a gigantic market for parallel database systems, which are distributed database systems with a different architecture, and I didn’t realize that that was where the big market was. I just missed that completely. I could’ve written Gamma, but I didn’t. That was a major theme that I missed completely, and it took me a very long time to realize that there really is no market for distributed database systems for all kinds of good reasons. At the end of the day, the real world is the ultimate jury and I was slow to realize that there was no market.

温斯莱特:您花了几十年的时间对对象数据库和垂直存储等专用数据管理工具嗤之以鼻。然后在 2000 年代,你开始争论“一刀切”的观点。你为什么改变主意?

Winslett: You spent decades pooh-poohing specialized data management tools such as object databases and vertical stores. Then in the 2000s, you started arguing that one size does not fit all. Why did you change your mind?

Stonebraker:在 20 世纪 80 年代,数据库只有一个市场。这是业务数据处理,对于该市场,关系模型似乎非常有效。之后,突然间出现了科学数据库。突然出现了网络日志。如今,地球上的每个人都需要一个数据库系统。我认为自 80 年代以来,市场已经令人难以置信地扩大,在非业务数据处理市场中,有时关系数据库是一个好主意,有时却不是。这种认识是由市场驱动的。由于市场非常不同,我改变了主意。

Stonebraker: In the 1980s, there was only one market for databases. It was business data processing, and for that market the relational model seems to work very well. After that, what happened was that all of a sudden there were scientific databases. All of a sudden there were Web logs. These days, everyone on the planet needs a database system. I think the market has broadened incredibly since the ’80s and in the non-business-data-processing piece of the market, sometimes relational databases are a good idea and sometimes they’re not. That realization was market-driven. I changed my mind based on the market being very different.

温斯莱特:有没有某个特定时刻发生过这种情况?你看到的一些东西让你思考,“我们必须多元化吗?”

Winslett: Was there a particular moment when that happened? Something you saw that made you think, “We have to diversify?”

Stonebraker:让我概括一下这个问题:好的想法从何而来?我没有任何线索。它们似乎只是发生了。我认为实现这些目标的最佳方法是与聪明人在一起,与很多很多人交谈,倾听他们的意见。然后慢慢地一些东西被吸收,然后一些事情发生了。

Stonebraker: Let me generalize that question a bit: Where do good ideas come from? I have no clue. They just seem to happen. I think the way to make them happen best is to hang around smart people, talk to lots and lots of people, listen to what they say. Then slowly something sinks in and then something happens.

例如,在我们编写 H-Store 的几年前,我曾与一位 VC 交谈过,他说:“你为什么不为 OLTP 提出一个主存数据库系统呢?” 我说:“因为我不知道该怎么做。” 但这产生了种子,意识到有人对这个话题感兴趣。最终想法出现了,我们建造了 H-Store。

For example, a couple of years before we wrote H-Store, I had talked to a VC who said, “Why don’t you propose a main memory database system for OLTP?” And I said, “Because I don’t have a good idea for how to do it.” But that generated the seed, the realization that somebody was interested in that topic. Eventually the ideas came, and we built H-Store.

我不知道这种情况何时发生或如何发生。我生活在恐惧中,担心没有更多的好主意。

I don’t know when this happens or how it happens. I live in terror of not having any more good ideas.

Winslett:有多少种不同形式的数据平台就太多了?

Winslett: How many different forms of data platform would be too many?

Stonebraker:肯定会有主存 OLTP 系统主要是行存储,并且肯定会有针对数据仓库市场的列存储。

Stonebraker: There will certainly be main memory OLTP systems that are mostly going to be row stores, and there will certainly be column stores for the data warehouse market.

我怀疑绝大多数科学数据库都是面向数组的,并且它们正在对其进行复杂的代码。我怀疑关系数据库系统不会在那里很好地工作,它会是其他东西,也许是一个数组存储,谁知道呢?在复杂的分析、奇异值分解、线性回归等所有这些东西中,这些人想要对主要面向数组的数据进行的操作,对于如何支持这些操作,目前还没有定论。

My suspicion is that the vast majority of scientific databases are array-oriented and they’re doing complex codes on them. My suspicion is that relational database systems are not going to work out very well there and that it would be something else, maybe an array store, who knows? In complex analytics, singular value decomposition, linear regression, all that stuff, which is the operations those kinds of folks want to do on largely array-oriented data, the jury is out as to how that’s going to be supported.

我不是基于图形的数据库系统的超级粉丝,因为我不清楚基于图形的系统是否比在表格系统或数组系统上模拟图形更快。我想我们会看看基于图的系统是否成功。XML 是昨天的伟大想法,但我认为它不会有任何发展,因此我不认为做 XML 存储是值得尝试的事情。

I’m not a huge fan of graph-based database systems, because it’s not clear to me that a graph-based system is any faster than simulating a graph either on a tabular system or an array system. I think we’ll see whether graph-based systems make it. XML is yesterday’s big idea and I don’t see that going anywhere, so I don’t see doing an XML store as a worthwhile thing to try.

Winslett:日志数据的专门存储怎么样?

Winslett: What about specialized stores for log data?

Stonebraker:在我看来,大多数日志处理在数据仓库中都可以正常工作。但毫无疑问,与日志前端相关的流处理要么是专门的流处理引擎(如 Kafka),要么是主内存数据库系统(如 VoltDB)。对于是否存在一个不同于 OLTP 的称为流数据库的特殊类别,目前还没有定论。

Stonebraker: It seems to me that most of the log processing works fine with data warehouses. But there’s no question that stream processing associated with the front end of the log will either be specialized stream processing engines like Kafka or main memory database systems like VoltDB. The jury is out as to whether there’s going to be a special category called streaming databases that’s different from OLTP.

我们可能需要 6 个专门的数据存储,但我认为我们不需要 20 个。我什至认为我们不需要 10 个。

We might need half a dozen specialized data stores, but I don’t think we need 20. I don’t even think we need ten.

温斯莱特:你说没有查询世界语。如果是这样,您为什么一直致力于 Polystores 和 BigDAWG?

Winslett: You say that there’s no query Esperanto. If so, why have you been working on polystores and BigDAWG?

Stonebraker: Polystores对我来说意味着支持多种查询语言,而BigDAWG有多种查询语言,因为我不认为有查询语言世界语。分布式数据库存在各种问题:首先,没有查询语言世界语。其次,独立构建的数据的模式永远不会相同。第三,数据总是脏的,而每个人都认为它是干净的。你必须有更灵活的多存储支持多种查询语言并集成数据清理工具,并且可以处理模式永远不会相同的事实。

Stonebraker: Polystores to me mean support for multiple query languages, and BigDAWG has multiple query languages, because I don’t think there is a query language Esperanto. Among the various problems with distributed databases: First, there isn’t a query language Esperanto. Second, the schemas are never the same on independently constructed data. Third, the data is always dirty, and everybody assumes that it’s clean. You’ve got to have much more flexible polystores that support multiple query languages and integrate data cleaning tools and can deal with the fact that schemas are never the same.

Winslett:这是谈论您的项目 Data Civilizer 的一个很好的开头,该项目旨在自动化查找、准备、集成和清理数据的繁重工作。我们能多好地解决这个问题?

Winslett: That’s a good lead-in to talking about your project Data Civilizer, which aims to automate the grunt work of finding, preparing, integrating, and cleaning data. How well can we solve this problem?

Stonebraker:数据文明器项目来自于许多人与一位在野外从事数据科学的数据科学家交谈后的观察。没有人声称自己花费了不到 80% 的时间来进行任何分析之前的数据处理。一名数据科学家每周最多花一天时间做她受聘的工作,另外四天做一些繁重的工作。默克 (Merck) 首席数据科学家马克·施赖伯 (Mark Schreiber) 声称,98% 的时间都在做繁重的工作,而不是 80%!因此,如果你是一名数据科学家,你的绝大多数时间都花在了无聊的工作上。在我看来,如果您担心数据分析,那么您就是在担心问题中的零钱部分。

Stonebraker: The Data Civilizer project comes from an observation made by lots of people who talk to a data scientist who is out in the wild doing data science. No one claims to spend less than 80% of their time on the data munging that has to come in advance of any analytics. A data scientist spends at most one day a week doing the job for which she was hired, and the other four days doing grunt work. Mark Schreiber, who’s the chief data scientist for Merck, claims it’s 98% time in grunt work, not 80%! So, the overwhelming majority of your time, if you’re a data scientist, is spent doing mung work. In my opinion, if you worry about data analytics, you’re worrying about the spare-change piece of the problem.

如果你想有所作为,你就必须担心这些繁杂工作的自动化。这就是数据文明者的目的。马克·施赖伯正在使用我们拥有的系统,他喜欢他所看到的,所以至少我们可以做出一些改变。我们能将这 80% 到 90% 减少多少还有待观察。作为一个研究社区,我们 20 年前就致力于数据集成,后来它的名声有点不好,但实际问题仍然存在,而且如果有的话,那就是情况要糟糕得多。我鼓励任何想要有所作为的人在该领域工作。

If you want to make a difference, you have to worry about automating the mung work. That’s the purpose of Data Civilizer. Mark Schreiber’s using the system that we have, and he likes what he sees, so at least we can make some difference. How much we can cut down this 80 to 90% remains to be seen. As a research community, we worked on data integration 20 years ago and then it got kind of a bad name, but the problems in the wild are still there and if anything, they’re much, much worse. I’d encourage anybody who wants to make a difference to go work in that area.

温斯莱特:你说你的默克员工喜欢他所看到的东西。你能量化一下吗?

Winslett: You said that your Merck guy likes what he sees. Can you quantify that?

Stonebraker:在顶层,Merck 拥有大约 4,000 个 Oracle 数据库。他们实际上并不知道自己有多少。除了他们的数据湖之外,还有无数的文件,所有可以想象到的东西。首先,如果您说,“我有兴趣找到一个可用于确定利他林是否会导致小鼠体重增加的数据集”,您的第一个问题是确定一个或多个实际上可能具有以下特征的数据集:你感兴趣的数据。所以,存在一个发现问题。

Stonebraker: At the top level, Merck has about 4,000 Oracle databases. They don’t actually know how many they’ve got. That’s in addition to their data lake, on top of uncountable files, on top of everything imaginable. For a starter, if you were to say, “I’m interested in finding a dataset that can be used to figure out whether Ritalin causes weight gain in mice,” your first problem is to identify a dataset or datasets that actually might have the data you’re interested in. So, there’s a discovery problem.

默克正在运行 Data Civilizer 的发现组件,您可以提出这样的问题:“我对利他林感兴趣,请告诉我一些包含利他林的数据集。” 他们正在使用它并且喜欢他们所看到的。

Merck is running the discovery component of Data Civilizer, which lets you ask questions like, “I’m interested in Ritalin, tell me about some datasets that contain Ritalin.” They’re using that and they like what they see.

除了发现之外,数据清理也是一个大问题。我们正在使用默克和其他公司作为测试案例来解决这个问题。这又回到了我们之前所说的:要在数据集成和数据清理方面发挥作用,你必须找到一个现实世界的问题,找到一个真正希望解决你的问题的企业。

Beyond discovery, data cleaning is a huge problem. We’re working on that using Merck and others as a test case. This ties back to what we said earlier: To make a difference in data integration and data cleaning, you’ve got to find a real-world problem, find an enterprise that actually wants your problem solved.

比如说,在做数据集成的时候,我见过的绝大多数产品都是这里的表一,这里的表二,你画一些线把东西连接起来。据我所知,这对任何人都没有帮助。例如,在原有Data Tamer系统的商业化中,葛兰素史克就是一个客户。他们有 100,000 个表,并且想要进行这种规模的数据集成,而任何手动绘制线条的方法都是行不通的。

For instance, in doing data integration, the overwhelming majority of the products I’ve seen have Table 1 over here, Table 2 over here, you draw some lines to hook stuff up. That doesn’t help anybody that I know of. For example, in the commercialization of the original Data Tamer system, GlaxoSmithKline is a customer. They’ve got 100,000 tables and they want to do data integration at that scale, and anything that manually draws lines is a non-starter.

作为一个研究团体,我们绝对有必要做鞋革,出去与野外的人们交谈,准确地弄清楚他们的数据问题是什么,然后解决它们,而不是解决我们编造的问题。

As a research community, it absolutely behooves us to do the shoe leather, to go out and talk to people in the wild and figure out exactly what their data problems are and then solve them, as opposed to solving problems that we make up.

温斯莱特:当然,但是数据集成一直在那些拉古纳海滩类型报告的十大问题列表中……

Winslett: Definitely, but data integration has been on that top ten list of problems of those Laguna Beach-type reports …

斯通布雷克:永远。

Stonebraker: Forever.

温斯莱特:总是。

Winslett: Always.

斯通布雷克:是的。

Stonebraker: Yes.

温斯莱特:事情发生了怎样的变化,我们终于能够获得一些关注了?

Winslett: How have things changed that we can finally get some traction?

Stonebraker:让我举一个简单的例子。我想你知道什么是采购系统吗?

Stonebraker: Let me give you a quick example. I assume you know what a procurement system is?

温斯莱特:当然。

Winslett: Sure.

Stonebraker:您认为通用电气有多少个采购系统?

Stonebraker: How many procurement systems do you think General Electric has?

温斯莱特:也许 500?

Winslett: Maybe 500?

Stonebraker:他们有 75 个,这已经够糟糕了。假设您是这 75 名采购官员中的一员,并且您与 Staples 的合同需要续签。如果你能弄清楚其他 74 个同行谈判的条款和条件,然后只要求最惠国待遇,那么你每年将为通用电气节省大约 5 亿美元。

Stonebraker: They have 75, which is bad enough. Let’s suppose you’re one of these 75 procurement officers and your contract with Staples comes up for renewal. If you can figure out the terms and conditions negotiated by your other 74 counterparts and then just demand most-favored-nation status, you’ll save General Electric something like $500 million a year.

温斯莱特:当然,但 25 年前就已经是这样了,对吧?

Winslett: Sure, but that was already true 25 years ago, right?

Stonebraker:是的,但是现在的企业比以前更加痛苦,而现代机器学习可以提供帮助。

Stonebraker: Yeah, but enterprises are in more pain now than they were back then, and modern machine learning can help.

企业整合其孤岛的愿望越来越高,要么是因为他们想省钱,要么是因为他们想要进行客户整合。公司想要做的很多事情都涉及数据集成。如果你意识到有 5 亿美元摆在桌面上,那么你就会变得不那么谨慎,去尝试疯狂的想法。Tamr 让我震惊的是 GE 愿意运行我所说的 pre-alpha 产品只是因为他们太痛苦了。一般来说,甚至没有人会运行数据库系统的 1.0 版本。如果你感到足够痛苦,那么你就会尝试新的想法。

The desire of corporations to integrate their silos is going up and up and up, either because they want to save money, or they want to do customer integration. There’s a bunch of things that companies want to do that all amount to data integration. If you realize that there’s $500 million on the table, then it leads you to not be very cautious, to try wild and crazy ideas. The thing about Tamr that just blows me away is that GE was willing to run what I would call a pre-alpha product just because they were in so much pain. Generally, no one will even run version 1.0 of a database system. If you’re in enough pain, then you’ll try new ideas.

就数据集成而言,非常非常简单。您应用机器学习和统计来自动完成任务,因为手动完成的任何操作(例如画线)都是行不通的。它不会规模化,这就是问题所在。数据集成人员还没有应用机器学习,但它的作用就像一个魅力……嗯,它的作用足够好,投资回报率很高!

In terms of data integration, it’s very, very simple. You apply machine learning and statistics to do stuff automatically because anything done manually, like drawing lines, is just not going to work. It’s not going to scale, which is where the problem is. Data integration people hadn’t applied machine learning, but it works like a charm … well, it works well enough that the return on investment is good!

温斯莱特:您过去曾对人工智能说过很多严厉的话。是否有那么一个时刻,人工智能终于转危为安,变得有用了?

Winslett: You’ve said many harsh words about AI in the past. Was there a moment when AI finally turned a corner and became useful?

Stonebraker:我认为机器学习非常有用,它将产生巨大的影响。无论是传统学习还是深度学习,这些东西都有效。我对其他类型的人工智能不太感兴趣。谷歌开创了深度学习的先河,其方式实际上适用于图像分析,我认为它也适用于自然语言处理。它确实在很多领域发挥作用。基于朴素贝叶斯模型、决策树等的传统机器学习在许多领域也表现良好,值得去做。

Stonebraker: I think machine learning is very useful, and it’s going to have a gigantic impact. Whether it’s conventional or deep learning, the stuff works. I’m much less interested in other kinds of AI. Google pioneered deep learning in a way that actually works for image analysis, and I think it works for natural language processing too. There’s a bunch of areas where it really does work. Conventional machine learning, based on naive Bayes models, decision trees, whatever, also works well enough in a large number of fields to be worth doing.

至少从三四年前开始,标准的创业想法是选择某个区域,比如选择酒店房间的定价。一位初创公司说:“好吧,这就是我想做的。我会先在拉斯维加斯市场尝试一下,”他们获得了所有可能与酒店房间相关的数据。他们运行了一个机器学习模型,发现您应该根据麦卡伦机场的抵达人数来设定酒店价格,这听起来是一件非常合理的事情。如果您将这种技术应用于任何预测问题,那么某些版本的机器学习很可能会起作用,当然,除非根本没有模式。

The standard startup idea, at least from three or four years ago, was to pick some area, say choosing pricing for hotel rooms. One startup said, “Okay, that’s what I want to do. I’ll try it in the Las Vegas market first,” and they got all the data they could find on anything that might relate to hotel rooms. They ran an ML model and they found out that you should set hotel prices based on arrivals at McCarran Airport, which sounds like a perfectly reasonable thing to do. If you apply this kind of technology to whatever your prediction problem is, chances are some version of ML is going to work, unless of course there’s no pattern at all.

但在很多情况下,存在一种模式,只是相当复杂,而且对你和我来说并不明显。机器学习可能会找到它。我认为机器学习的应用将会产生巨大的影响。

But in lots of cases there is a pattern, it’s just fairly complicated and not obvious to you and I. ML will probably find it. I think applications of ML are going to have a big impact.

Winslett:从您作为数据库研究人员的角度来看,您在过去几年中看到的硬件供应商所做的最聪明和最愚蠢的事情是什么?

Winslett: From your perspective as a database researcher, what are the smartest and dumbest things you’ve seen a hardware vendor do in the last few years?

Stonebraker:一百万年前,Informix 在关系数据库战争中输给了 Oracle。一连串的首席执行官认为解决问题的办法是收购一些初创公司,所以他们收购了 Illustra,这就是我工作的公司。后来他们又买了一家叫Red Brick Systems的公司,再后来又买了一家公司,名字我记不清了,他是做Java数据库系统的。他们认为救赎就是买一个人。

Stonebraker: A million years ago, Informix was losing the relational database wars to Oracle. A succession of CEOs thought the solution to the problem was to buy some startup, so they bought Illustra, which was the company I worked for. After that they bought a company called Red Brick Systems, and after that they bought a company whose name I can’t remember who built Java database systems. They thought that the salvation was going to be to buy somebody.

我认为这几乎总是一个愚蠢的想法,因为在所有这些情况下,公司确实没有计划如何整合他们所购买的产品,如何培训他们的销售人员如何销售它,如何以不同的方式销售它与他们已经拥有的东西结合起来。当橡胶遇到道路时,我在某处读到公司四分之三的收购都失败了。所以,我的建议是对你决定购买的东西要更加谨慎,因为很多时候效果并不好。从收购中获取价值意味着整合销售队伍、整合产品等等,但很多公司都搞砸了。

I think that’s almost always a dumb idea, because in all these cases, the company really didn’t have a plan for how to integrate what they were buying, how to train their sales force on how to sell it, how to sell it in conjunction with stuff they already had. When the rubber meets the road, I read somewhere that three-quarters of the acquisitions that companies make fail. So, my recommendation is to be a lot more careful about what you decide to acquire, because lots of times it doesn’t work out very well. Getting value from an acquisition means integrating the sales force, integrating the product, etc., etc., and lots and lots of companies screw that up.

惠普收购Vertica的时候,最大的问题是,惠普实在无法将Vertica的销售队伍与惠普的销售队伍整合起来,因为惠普的销售队伍知道怎么卖铁,而铁汉却卖不了Vertica。这是一套完全不同的技能。

When HP bought Vertica, the biggest problem was that HP really couldn’t integrate the Vertica sales force with the HP sales force, because the HP sales force knew how to sell iron and iron guys couldn’t sell Vertica. It was a totally different skill set.

Winslett:您认为近年来硬件方面的进步对数据库世界确实有好处吗?

Winslett: Are there advances in hardware in recent years that you think have been really good for the database world?

Stonebraker:我认为 GPU 对于一小部分数据库问题肯定会很有趣。如果您想进行顺序扫描,GPU 会非常适合。如果你想做奇异值分解,那都是浮点计算,而GPU在浮点计算上快得令人眼花缭乱。不过,需要注意的是,您的数据集必须适合 GPU 内存,否则加载它时将受到网络限制。这将是一个利基市场。

Stonebraker: I think GPUs for sure will be interesting for a small subset of database problems. If you want to do a sequential scan, GPUs do great. If you want to do singular value decomposition, that’s all floating-point calculations, and GPUs are blindingly fast at floating point calculations. The big caveat, though, is that your dataset has to fit into GPU memory, because otherwise you’re going to be network-bound on loading it. That will be a niche market.

我认为非易失性 RAM 肯定会到来。我不太关心它会产生多大的影响,因为它的速度不够快,不足以取代主内存,而且它的价格也不够便宜,无法取代固态存储或磁盘。这将是内存层次结构中的一个额外级别,人们可能会或可能不会选择使用。我认为这不会改变游戏规则。

I think non-volatile RAM is definitely coming. I’m not a big fan of how much impact it’s going to have, because it’s not fast enough to replace main memory and it’s not cheap enough to replace solid-state storage or disk. It will be an extra level in the memory hierarchy that folks may or may not choose to make use of. I think it’s not going to be a huge game changer.

我认为 RDMA 和 InfiniBand 将是一件非常非常非常大的事情。我概括地说:网络的速度比 CPU 和内存的速度更快。我们在实现 Vertica 等分布式系统时都假设我们受网络限制,但现在情况已不再如此。这将引起对大多数分布式系统的大量重新思考。对数据库进行分区要么不再有意义,要么意义有限。同样,如果您正在运行 InfiniBand 和 RDMA,那么 Tim Kraska 证明新型并发控制系统可能优于我们当前正在做的系统,这将影响主内存数据库系统。

I think RDMA and InfiniBand will be a huge, huge, huge deal. Let me put it generally: Networking is getting faster at a greater rate than CPUs and memory are getting beefier. We all implemented distributed systems such as Vertica with the assumption that we were network-bound, and that’s not true anymore. That’s going to cause a fair amount of rethinking of most distributed systems. Partitioning databases either makes no sense anymore or it makes only limited sense. Similarly, if you’re running InfiniBand and RDMA, then Tim Kraska demonstrated that new kinds of concurrency control systems are perhaps superior to what we’re currently doing, and that is going to impact main memory database systems.

网络的进步带来了很大的变化,但我认为除此之外,詹姆斯·汉密尔顿(James Hamilton)是一位超级聪明的人,他目前估计亚马逊可以以自己成本的 25% 来建立一个服务器节点。迟早会这样使每个人都使用基于云的系统,无论您是让亚马逊运行您的专用硬件还是使用共享硬件或其他什么。我们都将迁移到云端。这将是所有大学和大多数企业高架地板计算机房的终结。我认为这将产生令人难以置信的影响,并让我们回到分时共享的时代。善有善报恶有恶报。

The networking advances make a big difference, but I think on top of this, James Hamilton, who is one super smart guy, currently estimates that Amazon can stand up a server node at 25% of your own cost to do it. Sooner or later that’s going to cause absolutely everybody to use cloud-based systems, whether you’re letting Amazon run your dedicated hardware or you’re using shared hardware or whatever. We’re all going to move to the cloud. That’s going to be the end of raised floor computer rooms at all universities and most enterprises. I think that’s going to have an unbelievable impact and sort of brings us back to the days of time-sharing. What goes around comes around.

我认为这反过来会让计算机架构研究变得困难,因为如果有六个巨大的云供应商运行 1000 万个节点,而我们其他人到处都有几个节点,那么你几乎必须为其中一家巨头工作才能获得数据以产生影响。

I think that that in turn is going to make it difficult to do computer architecture research, because if there are half a dozen gigantic cloud vendors running 10 million nodes, and the rest of us have a few nodes here and there, then you pretty much have to work for one of the giants to get the data to make a difference.

Winslett:您希望数据库理论人员现在从事什么工作?

Winslett: What do you wish database theory people would work on now?

Stonebraker:这是我希望有人参与的事情。我们教授写了所有的教科书,在数据库设计的主题上,所有的教科书都说要构建一个实体关系模型,当你对它感到满意时,按一个按钮,它就会转换为第三范式。然后根据第三范式表集进行编码,这就是普遍的智慧。事实证明,在现实世界中,没有人使用这些东西。没有人。或者,如果他们使用它,他们会将其用于绿地初始设计,然后停止使用它。据我所知,原因是初始模式设计#1 是随着业务条件变化而进行模式演变的第一个设计。

Stonebraker: Here’s something that I would love somebody to work on. We professors write all the textbooks, and on the topic of database design, all the textbooks say to build an entity relationship model, and when you’re happy with it, push a button and it gets converted to third normal form. Then code against that third normal form set of tables, and that’s the universal wisdom. It turns out that in the real world, nobody uses that stuff. Nobody. Or if they use it, they use it for the green-field initial design and then they stop using it. As near as I can tell, the reason is that the initial schema design #1 is the first in an evolution of schemas as business conditions change.

当您从模式#1 移动到模式#2 时,目标永远不是保持数据库尽可能干净。我们的理论是:“重做您的 ER 模型,获取一组新表,然后按下按钮。” 这将使模式无限地保持在第三范式,即良好的状态。没有人使用它,因为他们的目标是最大限度地减少应用程序维护,因此他们让数据库模式根据需要变得肮脏,以减少应用程序维护量。

When you move from schema #1 to schema #2, the goal is never to keep the database as clean as possible. Our theory says, “Redo your ER model, get a new set of tables, push the button.” That will keep the schema endlessly in third normal form, a good state. No one uses that because their goal is to minimize application maintenance, and so they let the database schema get as dirty as required in order to keep down the amount of application maintenance.

如果理论专家能够提出一些数据库应用程序协同进化的理论,那就太好了。这显然就是现实世界所做的。我对理论人员的要求是,他们找到一个有人感兴趣的现实世界问题,可以使用你的工具包来解决。请不要人为地编造问题然后去解决。

It would be nice if the theory guys could come up with some theory of database application coevolution. That’s clearly what the real world does. My request to the theory guys is that they find a real-world problem that somebody’s interested in that your toolkit can be used to address. Please don’t make up artificial problems and then solve them.

温斯莱特:这对任何研究人员来说都是很好的建议。

Winslett: That’s good advice for any researcher.

MapReduce 的成功让很多新人对大数据感到兴奋,数据库社区可以从中学到什么教训?

What lessons can the database community learn from MapReduce’s success in getting a lot of new people excited about big data?

Stonebraker:我认为 MapReduce 是一场彻头彻尾的灾难。让我准确地说。我说的是 MapReduce,Google 的一个有地图的东西以及由 Yahoo 重写并称为 Hadoop 的归约操作。这是一个带有映射操作和归约操作的特定用户界面。那是完全没有价值的。问题是没有人的问题简单到足以让这两个操作起作用。

Stonebraker: I view MapReduce as a complete and unmitigated disaster. Let me be precise. I’m talking about MapReduce, the Google thing where there’s a map and a reduce operation that was rewritten by Yahoo and called Hadoop. That’s a particular user interface with a map operation and a reduce operation. That’s completely worthless. The trouble with it is that no one’s problem is simple enough that those two operations will work.

如果你是Cloudera,那么你现在遇到了一个大问题,因为你一直在兜售MapReduce,但它没有市场。完全没有市场。因此,Cloudera 非常谨慎地进行了一些营销,并表示:“Hadoop 不再意味着 Hadoop。意思是一个三级堆栈,底层是HDFS,中间是MapReduce,顶层是SQL。这就是 Hadoop 现在的含义:它是一个堆栈。” 但是,您仍然遇到问题,因为 Cloudera 堆栈的 MapReduce 部分没有市场。因此,Cloudera 的下一个行动是通过实现关系 SQL 引擎 Impala 来弃用 MapReduce 部分,该引擎完全放弃 MapReduce 并在 HDFS 之上执行 SQL。实际上,Cloudera 意识到 75% 的“Hadoop 市场”是 SQL,而 MapReduce 是无关紧要的。

If you’re Cloudera, you’ve now got a big problem because you’ve been peddling MapReduce and there’s no market for it. Absolutely no market. As a result, Cloudera very carefully applied some marketing and said, “Hadoop doesn’t mean Hadoop anymore. It means a three-level stack with HDFS at the bottom, MapReduce in the middle, and SQL at the top. That’s what Hadoop means now: it’s a stack.” However, you still have a problem because there is no market for the MapReduce piece of the Cloudera stack. So, the next Cloudera action was to deprecate the MapReduce piece by implementing a relational SQL engine, Impala, which drops out MapReduce completely and does SQL on top of HDFS. In effect, Cloudera realized that 75% of the “Hadoop market” is SQL and that MapReduce is irrelevant.

在 SQL 实现中,没有 MapReduce 接口的位置。没有任何数据仓库产品使用类似的东西,并且 Cloudera Impala 看起来与其他数据仓库人员的产品完全一样。在我看来,“Hadoop市场”实际上是一个SQL数据仓库市场。希望云专家、Hadoop 专家以及传统数据库供应商能够决一胜负,看看谁能获得最佳实施。

In an SQL implementation, there is no place for a MapReduce interface. None of the data warehouse products use anything like that, and Cloudera Impala looks exactly like the other data warehouse guys’ products. In my opinion, the “Hadoop market” is actually a SQL data warehouse market. May the cloud guys and the Hadoop guys and the traditional database vendors duke it out for who’s got the best implementation.

Winslett:但是 MapReduce 没有让很多潜在用户对他们可以利用数据做什么感到兴奋吗?

Winslett: But didn’t MapReduce get a lot of potential users excited about what they might be able to do with their data?

斯通布雷克:是的。

Stonebraker: Yes.

温斯莱特:这是一种入门药物,但你仍然不认可它。

Winslett: It’s a gateway drug, but you still don’t approve of it.

Stonebraker:很多公司都喝了 MapReduce Kool-Aid,花很多钱购买 40 节点的 Hadoop 集群,现在他们正试图弄清楚如何利用它们。一些可怜的笨蛋必须弄清楚如何处理运行在 40 个节点集群上的 HDFS 文件系统,因为没有人想要 MapReduce

Stonebraker: Lots of companies drank the MapReduce Kool-Aid, went out and spent a lot of money buying 40-node Hadoop clusters, and they’re now trying to figure out what the heck to do with them. Some poor schmuck has to figure out what in the world to do with an HDFS file system running on a 40-node cluster, because nobody wants MapReduce.

Hadoop 供应商表示,“数据湖很重要”,这是一个很好的营销机会。数据湖只不过是一个垃圾抽屉,您可以将所有数据扔到一个公共位置,这应该是一件好事。数据湖的问题在于,如果您认为它们可以解决您的数据集成问题,那么您就大错特错了。他们只解决了其中的一小部分。如果您意识到数据湖只是您工具包的一部分,我根本不反对数据湖来做数据整合。如果您认为所有数据集成需求都是 MapReduce 系统,那么您就大错特错了。

Never to be denied a good marketing opportunity, the Hadoop vendors said, “Data lakes are important.” A data lake is nothing but a junk drawer where you throw all of your data into a common place, and that ought to be a good thing to do. The trouble with data lakes is that if you think they solve your data integration problem, you’re sadly mistaken. They address only a very small piece of it. I’m not opposed to data lakes at all, if you realize that they are just one piece of your toolkit to do data integration. If you think that all data integration needs are a MapReduce system, you’re sadly mistaken.

如果您认为数据湖是您的数据仓库解决方案,那么问题是,现在 Cloudera 没有广播的事实是 Impala 并不是真正运行在 HDFS 之上。在数据仓库系统中,您最不想要的就是像 HDFS 这样的存储引擎,它具有三重冗余但没有事务,并且会将您的数据放在各处,让您不知道它在哪里。Impala实际上是通过HDFS来读写底层Linux文件的,这正是所有仓库产品所做的。

If you think that the data lake is your data warehouse solution, the problem is that right now, the actual truth that Cloudera doesn’t broadcast is that Impala doesn’t really run on top of HDFS. The last thing on the planet you want in a data warehouse system is a storage engine like HDFS that does triple-redundancy but without transactions, and that puts your data all over everywhere so that you have no idea where it is. Impala actually drills through HDFS to read and write the underlying Linux files, which is exactly what all the warehouse products do.

实际上,大数据市场主要是一个数据仓库市场,最好的供应商可能会获胜。我们之前讨论过机器学习,我认为复杂的分析将取代商业智能。希望这会将整个讨论转变为如何大规模支持 ML,以及数据库系统是否在该解决方案中占有重要地位。我认为这个解决方案到底是什么是一个非常有趣的问题。

In effect, the big data market is mostly a data warehouse market, and may the best vendor win. We talked about ML earlier, and I think that complex analytics are going to replace business intelligence. Hopefully that will turn this whole discussion into how to support ML at scale, and whether database systems have a big place in that solution. Exactly what that solution is going to be, I think, is a very interesting question.

Winslett:您对 Google 推出的数据库技术(例如 Cloud Spanner)有何看法?

Winslett: What do you think of the database technology coming out of Google, like Cloud Spanner?

Stonebraker:让我们从很久以前开始吧。Google 说的第一句话是 MapReduce 是一个专门构建的系统,用于支持其搜索引擎的 Web 抓取,并且 MapReduce 是自切片面包以来最好的东西。大约五年过去了,我们其他人都说,“Google 说 MapReduce 很棒,所以它一定很好,因为 Google 是这么说的”,我们都加入了 MapReduce 的行列。大约在同一时间,Google 正在放弃 MapReduce,将其用于专门构建的应用程序,即网络搜索。MapReduce完全没用,所以Google做了一系列的事情。有 BigTable、BigQuery、Dremel、Spanner……我个人认为,Spanner 有点误导。

Stonebraker: Let’s start way back when. The first thing Google said was that MapReduce was a purpose-built system to support their Web crawl for their search engine, and that MapReduce was the best thing since sliced bread. About five years went by and all the rest of us said, “Google said MapReduce is terrific, so it must be good, because Google said so,” and we all jumped on the MapReduce bandwagon. At about the same time Google was getting rid of MapReduce for the application for which it was purpose-built, namely, Web search. MapReduce is completely useless, and so Google has done a succession of stuff. There’s BigTable, BigQuery, there’s Dremel, there’s Spanner … I think, personally, Spanner is a little misguided.

很长一段时间以来,谷歌一直认为最终一致性是正确的做法。他们所有的初始系统都是最终一致性的。他们也许在 2014 年就明白了数据库人员一直在说的话,那就是最终一致性实际上会产生垃圾。你想让我解释一下为什么吗?

For a long time, Google was saying eventual consistency is the right thing to do. All their initial systems were eventual consistency. They figured out maybe in 2014 what the database folks had been saying forever, which is that eventual consistency actually creates garbage. Do you want me to explain why?

温斯莱特:没有。

Winslett: No.

斯通布雷克:好的。本质上每个人都摆脱了最终一致性,因为它根本不提供一致性保证。最终一致性是谷歌的另一个误导,这只是一个坏主意。这些都是糟糕的想法,因为谷歌内部没有任何数据库专业知识。他们把随机的人放在项目上来建造东西,他们建造了他们想要的任何东西,而不需要任何东西。真正学习数据库人员多年来学到的教训。

Stonebraker: Okay. Essentially everybody has gotten rid of eventual consistency because it gives no consistency guarantee at all. Eventual consistency was another piece of misdirection from Google that just was a bad idea. These were bad ideas because Google didn’t have any database expertise in house. They put random people on projects to build stuff and they built whatever they wanted to without really learning the lessons that the database folks had learned over many, many years.

谷歌在 Spanner 中的观点是:“我们不会实现最终一致性。我们将实现事务一致性,并且我们将通过广域网实现这一点。” 如果你控制端到端网络,这意味着你拥有路由器,你拥有电线,你拥有这里和那里之间的一切,那么我认为 Spanner 非常非常聪明地发现你可以将延迟降低到哪里分布式提交通过广域网进行。

Google takes the point of view in Spanner of, “We’re not going to do eventual consistency. We’re going to do transactional consistency, and we’re going to do it over wide area networks.” If you control the end-to-end network, meaning you own the routers, you own the wires, you own everything in between here and there, then I think Spanner very, very cleverly figured out that you could knock down the latency to where a distributed commit worked over a wide area network.

问题是你和我无法控制端到端网络。我们无法将延迟降低到谷歌所能做到的程度。我认为,一旦你没有在专用的端到端铁上运行,Spanner 的想法就无法将延迟降低到足以让现实世界的人们愿意使用它的程度。

The problem is that you and I don’t control the end-to-end network. We have no way to knock the latency down to what Google can do. I think the minute you’re not running on dedicated end-to-end iron, the Spanner ideas don’t knock the latency down enough to where real-world people are willing to use it.

当你我可以购买的广域网上的分布式交易速度足够快以至于我们愿意运行它们时,我会感到很兴奋。我认为那会很棒。从某种意义上说,Spanner 在完全专用的铁杆方面处于领先地位。

I will be thrilled when distributed transactions over the wide area networks that you and I can buy will be fast enough that we’re willing to run them. I think that will be great. In a sense, Spanner leads the way on totally dedicated iron.

Winslett:机器学习在解决数据库问题方面有哪些容易实现的成果?

Winslett: What low-hanging fruit is there for machine learning in solving database problems?

Stonebraker:我们一直在建立一个支持自动驾驶汽车的数据库。目前,自动驾驶技术人员希望跟踪特定图像中是否有行人,特定图像中是否有自行车。到目前为止,他们想要跟踪六件事,但您可能想要跟踪的事物数量至少为 500 个。停车标志、免费停车位、紧急车辆、不安全的变道、左急转弯…假设您可能想要对 500 件事情建立索引,然后找出实际对哪些内容建立索引。例如,您可能想要对玉米地建立索引。在厄巴纳,这可能是一个非常好的主意。

Stonebraker: We’ve been building a database for supporting autonomous vehicles. Right now, AV folks want to keep track of whether there’s a pedestrian in a particular image, whether there’s a bicycle in a particular image. So far, they want to keep track of half a dozen things, but the number of things you might want to keep track of is at least 500. Stop signs, free parking spaces, emergency vehicles, unsafe lane changes, sharp left-hand turns … Assume there are 500 things you might want to index and then figure out which ones to actually index. For instance, you might want to index cornfields. In Urbana, that’s probably a really good idea.

温斯莱特:因为玉米可能会站起来走在车前面?

Winslett: Because the corn might get up and walk in front of the car?

Stonebraker:嗯,因为……

Stonebraker: Well, because …

温斯莱特:我宁愿看到鹿被编入索引。还有袋鼠,因为它们似乎也有死亡的愿望。

Winslett: I’d rather see deer indexed. And kangaroos, because they seem to have a death wish too.

斯通布雷克:那就好了。我只是说有很多东西可能值得建立索引,而且它们都是根据具体情况而定的。玉米田就是一个很好的例子,因为伊利诺伊州有很多玉米田,但在马萨诸塞州 128 号公路内几乎没有玉米田。您必须弄清楚什么真正值得建立索引。您可能可以应用机器学习来观察人们所做的查询,并首先对所有内容建立索引,然后意识到有些事情很少相关不值得继续为它们建立索引。应用机器学习来做到这一点,而不是让它成为一个手动的事情,可能很有意义。

Stonebraker: That’d be fine. I’m just saying there’s a lot of things that might be worth indexing, and they’re very situational. Cornfields are a good example because there are lots of them in Illinois, but there aren’t hardly any inside Route 128 in Massachusetts. You’ve got to figure out what’s actually worth indexing. You can probably apply machine learning to watch the queries that people do and start by indexing everything and then realize that some things are just so rarely relevant that it isn’t worth continuing to index them. Applying ML to do that, rather than have it be a manual thing, probably makes a lot of sense.

Winslett:您是否认为机器学习在查询优化中发挥着更广泛的作用,或者它只是变成了一种魔法?

Winslett: Do you see a broader role for ML in query optimization, or has it just become a kind of black art?

Stonebraker:这当然值得一试。运行一个计划,记录它的表现,下次选择不同的计划,建立一个包含运行时间的计划数据库,然后看看是否可以在其上运行 ML 来做得更好,这是完全合理的。我认为这是一个值得尝试的有趣的事情。

Stonebraker: It’s certainly worth a try. It’s perfectly reasonable to run a plan, record how well it did, choose a different plan next time, build up a plan database with running times and see if you can run ML on that to do better. I think it’s an interesting thing to try.

温斯莱特:你们在将许多大学聚集在一起开展一个综合项目方面取得了惊人的成功。这是你在搬到麻省理工学院后发现的孤独枪手心态的解决方案吗?

Winslett: You’ve been stunningly successful in pulling together a bunch of universities to work on an integrated project. Was this your solution to the lone gunslinger mentality that you found when you moved to MIT?

Stonebraker:我别无选择。我独自一人。没有教师,没有学生,没有课程,什么都没有。唯一的策略是接触波士顿地区的其他大学。这种策略在厄巴纳不太有效,因为附近没有足够的大学,但在主要大都市地区情况有所不同。在波士顿,有六到八所大学,每所大学都有一到两个数据库人员。总的来说,你们可以成为一个非常非常强大的分布式团队。

Stonebraker: I had no choice. I was alone. There were no faculty, no students, no courses, no nothing. The only strategy was to reach out to the other universities in the Boston area. That strategy wouldn’t work very well in Urbana because there aren’t enough close-by universities, but in major metropolitan areas it’s different. In Boston, there are six or eight universities, each with one or two database people. In aggregate, you can be a very, very strong distributed group.

Winslett:让分布式协作发挥作用非常困难,但你让它发挥了作用。看来物理上的接近仍然发挥了作用。你们多久聚一次?

Winslett: It’s very hard to make a distributed collaboration work, but you made it work. It seems like physical proximity still played a role. How often did you get together?

Stonebraker:我每周开车去一次布朗。实际上,我们每周举行一次真正的小组会议,人们开车参加。仅当您地理位置接近时才有效。

Stonebraker: I drove to Brown once a week. In effect, we held real group meetings once a week and people drove to them. That only works if you have geographic proximity.

Winslett:使分布式协作发挥作用的其他关键因素是什么?

Winslett: Other key ingredients in making the distributed collaboration work?

Stonebraker:我认为我有一个很大的优势,那就是人们愿意听我的意见并且几乎按照我的建议去做。普遍的问题是,各种想法杂乱无章,无法融合。必须有某种方法来收敛,要么需要一名首席枪手,要么需要来自 DARPA 的一名愿意敲头的项目监视器。必须有某种方法来聚集人们,而我已经成功地做到了这一点。

Stonebraker: I think I had the great advantage that people were willing to listen to me and pretty much do what I suggested. The general problem is that there’s a cacophony of ideas with no way to converge. There’s got to be some way to converge, and either that takes a lead gunslinger, or it takes a program monitor from DARPA who’s willing to knock heads. There’s got to be some way to converge people, and I’ve managed to do that, pretty much.

温斯莱特:除了这两种成分之外,还有其他值得一提的成分吗?

Winslett: Any other ingredients worth mentioning, beyond those two?

Stonebraker:我还有一个很大的优势,就是我不需要更多的出版物,所以我很高兴写别人作为第一作者的论文。愿意在出版游戏中不参与其中是有帮助的。它会产生很多善意,以确保您是最后一位作者而不是第一作者。

Stonebraker: I also have a big advantage that I don’t need any more publications, and so I’m happy to write papers that other people are the first author on. It helps to be willing to have no skin in the publication game. It generates a lot of goodwill to make sure that you’re the last author and not the first author.

Winslett: Postgres 的一大乐趣在于它允许人们试验数据库组件(连接算法、索引结构、优化技术),而无需构建系统的其余部分。对于今天来说,同样开放的软件系统是什么?

Winslett: One of the great joys of Postgres is that it allowed people to experiment with database components—join algorithms, index structures, optimization techniques—without having to build the rest of the system. What would be an equally open software system for today?

Stonebraker: Postgres 的分布式版本。

Stonebraker: A distributed version of Postgres.

温斯莱特:谁来建造它?

Winslett: Who’s going to build that?

斯通布雷克:我知道!据我所知,还没有一个真正优秀的开源多节点数据库系统,而如何构建一个系统还有待观察。最大的问题是构建它需要大量的工作。随着时间的推移,它可能来自 Impala。它可能来自商业供应商之一。商业供应商的问题在于,标准的做法是让系统的一部分是开源的,然后系统的其余部分是专有的。正是分布式层往往是专有的。供应商都想要免费增值定价模式,这使得他们的许多系统具有专有性。

Stonebraker: I know! There is no open source multi-node database system I’m aware of that’s really good, and how one could be built remains to be seen. The big problem is that building it is a tremendous amount of work. It could come from Impala over time. It could come from one of the commercial vendors. The trouble with the commercial vendors is that the standard wisdom is to have a teaser piece of the system that’s open-source and then the rest of the system is proprietary. It’s exactly the distributed layer that tends to be proprietary. The vendors all want freemium pricing models and that makes a bunch of their system proprietary.

我不认为这样的系统可以来自学术界,它太难了。我认为在大学里构建像 Ingres 和 Postgres 这样的系统的日子已经一去不复返了。平均博士学位 学生或博士后必须发表大量的东西才能找到工作,他们不愿意编写大量代码然后只写一篇论文,这就是 Ingres 和 Postgres 的编写方式。我们的研究生写了很多代码,发表了很少的文章,这作为一种策略已经不再可行。

I don’t think such a system can come from academia, it’s just too hard. I think the days of building systems like Ingres and Postgres in universities are gone. The average Ph.D. student or postdoc has to publish a huge amount of stuff in order to get a job, and they’re not willing to code a lot and then write just one paper, which was the way Ingres and Postgres got written. We had grad students who coded a lot and published a little, and that’s no longer viable as a strategy.

温斯莱特:你能和硕士生一起做吗?

Winslett: Could you do it with master’s students?

斯通布雷克:也许吧。我们假设要让这个分布层变得快速、可靠且真正有效需要 10 个人年的工作量。也许更多,但关键是工作量很大。硕士生的任期最长为两年,假设他们很优秀,你可能会从他那里得到一年的富有成效的工作(平均可能是六个月)。这意味着您需要 20 名这样的人。这意味着它发生了十多年。这不像是二十个干部出来说:“来,管理我。”

Stonebraker: Maybe. Let’s assume in round numbers that getting this distribution layer to be fast, reliable, and really work takes 10 man-years’ worth of work. Maybe it’s more, but the point is that it’s a lot of work. A master’s student is around for a maximum of two years, and you get maybe one year of productive work out of that person, assuming that they’re good (the average may be six months). So that means you need 20 of these people. That means it occurs over a decade. It isn’t like a cadre of 20 of them show up and say, “Here, manage me.”

Ingres 和 Postgres 都是由一名全职人员和三四五名研究生编写的,没有博士后。那时候,在这样的团队规模下,你可以在几年内构建出一些东西。如今,让事情发挥作用变得更加困难。

Ingres and Postgres were both written with one full-time person and three or four or five grad students, no postdocs. Back then, you could get something built in a few years with that scope of a team. Today it’s just much harder to get stuff to work.

温斯莱特:大数据世界正掀起一股创业热潮。我们明白:通常,从投资界筹集资金比从传统的学术资助来源更容易。如果初创公司的知识产权中隐藏着如此多的想法,我们如何才能保持科学地推动一个领域发展所需的透明度呢?

Winslett: The big data world has startup fever. We get it: often, it’s easier to raise money from the investment community than from traditional sources of academic funding. How can we maintain the transparency required to move a field forward scientifically if so many ideas are hidden inside the IP of startups?

斯通布雷克:我知道!国家科学基金会提案的成功率下降到 7% 左右,这让我感到沮丧。在传统的开源、开放 IP 世界中筹集资金变得异常困难。我认为这是一个很大的问题。

Stonebraker: I know! I find it distressing that the success rate for National Science Foundation proposals is down to something like 7%. It’s getting incredibly hard to raise money in the traditional open-source, open-IP kinds of worlds. I think it’s a huge problem.

我的看法是,计算机科学任何特定学科的教师数量比 20 年前增加了一个数量级。需要养活的人口数量至少增加了同样的数量,而资金却根本没有跟上。我想我们快饿死了。

The way I look at it is that the number of faculty in any given discipline of computer science is up by an order of magnitude over what it was 20 years ago. The number of mouths to feed is up by at least that same amount, and funding has not kept pace at all. I think we’re starving.

当你对筹集资金感到厌恶时,解决办法就是离开大学,去谷歌或其他公司。如果大学的人才流失严重,我不会感到惊讶。

The solution when you get disgusted with trying to raise money is that you leave the university and go to Google or another company. I wouldn’t be surprised if the brain-drain out of universities gets to be significant.

温斯莱特:为什么用自己的钱创办一家初创公司是一个坏主意?

Winslett: Why is it a bad idea to bootstrap a startup using your own money?

Stonebraker:看看 Vertica、Illustra 以及我创办的任何一家公司。以整数计算,他们需要 20 或 3000 万美元才能推出可靠、稳定、可销售的产品。如果您正在编写 iPhone 应用程序,情况就不同了。但编写企业软件需要大量资金,而获得可以发布为版本 1 的软件通常需要 5 到 1000 万美元。除非你是独立富有的人,否则自筹资金是不可能的。

Stonebraker: Look at Vertica, Illustra, any of the companies that I’ve started. In round numbers, they required $20 or $30 million to get a reliable, stable, sellable product out the door. If you’re writing an iPhone app, that’s a different situation. But writing enterprise software takes a lot of money, and getting to something that you can release as version 1 is usually $5 to $10 million. Unless you’re independently wealthy, that’s not possible with self-funding.

我见过的那些成功的自筹资金的公司都有一个糖爹,一家公司说,“我会支付版本 1 的费用,因为我需要它作为一个应用程序,只要你写一些我想要的东西。” 如果有的话,基本上就是让客户为开发提供资金。

The self-funded companies I’ve seen that have succeeded have had a sugar daddy, a corporation that said, “I’ll pay for version 1 because I need it as an application, as long as you write something that I want.” If you have that, you’re basically having the customer fund the development.

如果你真的打算用自己的支票账户来支付这笔费用,那么问题是你和你的五个朋友同意在晚上和周末编写代码,因为你必须有白天的工作来让收账员远离你。如果您在晚上和周末编写代码,那么这将花费很长时间。

If you’re actually going to fund it out of your own checking account, the trouble is that you and five of your friends agree to write code at nights and weekends, because you’ve got to have day jobs to keep the bill collectors away. It just takes forever if you’re writing code nights and weekends.

自筹资金的另一个问题是,如果你用自己的钱进行投资,相对于风险投资公司的决策,你会做出非常非常谨慎的决定。换句话说,他们是比你更好的商人,并且会比你做出更好的金钱决策。用你的房子抵押来资助一家初创公司是我永远不会做的事情……这是破坏你婚姻的明显方式。

Another trouble with self-funding is that if your own money is invested, you make very, very cautious decisions relative to what VCs would make. In other words, they’re much better businessmen than you are and will make much better decisions about money than you will. Mortgaging your house to fund a startup is something I would never do … that’s a clear way to break up your marriage.

温斯莱特:您建议初创公司创始人专注于出色的工程,但当我考虑数据世界中的巨型公司时,我的印象是专注于出色的营销是建立市场份额的更有效途径。

Winslett: You recommend that startup founders focus on great engineering, but when I consider the giant corporations in the data world, I get the impression that focusing on great marketing has been a more effective route to build market share.

斯通布雷克:都是真的。Oracle 和 Ingres 就是您的最佳选择。一个有出色的工程,一个有出色的营销,看看谁赢了。

Stonebraker: All true. You don’t have to look any further than Oracle and Ingres. One had great engineering, one had great marketing, and look who won.

问题在于,您的首要目标必须是构建足够可靠的产品,以便您的前五个客户会购买它。如果你没有真正优秀的工程技术,你很可能无法达到这个里程碑。如果你只是把东西放在一起,很可能会出现严重的可靠性问题,而修复这些问题的成本将非常昂贵。这很可能会影响您能否从前五个客户那里获得收入。

The trouble is that your first objective must be to build something that’s reliable enough that your first five customers will buy it. If you don’t have really good engineering, chances are you’re not going to get to that milestone. If you just threw stuff together, chances are it’s going to have serious reliability problems, which are going to be very expensive to fix. Chances are that’s going to impact whether you can get revenue out of your first five customers.

因此,一开始就担心卓越的工程设计确实是个好主意。之后,确保您拥有世界上最好的营销副总裁是一个很棒的策略。

So, worrying about superb engineering at the beginning is a really good idea. After that, making sure you have the world’s best VP of marketing is a terrific strategy.

Winslett:您的研究小组经常实施完整的 DBMS,Andy Pavlo 撰写了有关这带来的挑战的文章。您仍然认为这是推进最先进技术的最佳方式吗?

Winslett: Your research group has often implemented full DBMSs, and Andy Pavlo has written about the challenges this raises. Do you still feel that this is the best way to advance the state of the art?

Stonebraker:将它们称为完整的 DBMS 有点牵强。正如 Martin Kersten 所指出的,便利店运行了大约 10 次查询。这根本不是一个完整的实施。我们将其作为一个完整的系统进行营销,但它确实没有优化器。它硬编码了如何在我们的基准测试中进行查询。我们在这方面做了很多工作。

Stonebraker: Calling them full DBMSs is a big stretch. As has been pointed out by Martin Kersten, C-Store ran about 10 queries. It was not a complete implementation at all. We marketed it as a complete system, but it really didn’t have an optimizer. It hard-coded how to do the queries in our benchmarks. We cut a lot of corners on it.

H-Store比C-Store更完整,但学术版没有复制系统。Andy Pavlo 所做的就是将 H-Store 执行器的大部分内容撤掉,并用开源的 VoltDB 执行器取代。H-Store变得更好主要是因为他刷了开源商业代码。自 Postgres 以来,我们确实还没有生产出所谓的功能齐全、运行良好的系统。我认为我们已经编写了这样一个系统的各个部分。

H-Store was more complete than C-Store, but the academic version didn’t have a replication system. What Andy Pavlo did was heave most of the H-Store executor and replace it by the open source of the VoltDB executor. H-Store got better mostly because he swiped open-source commercial code. Since Postgres, we really haven’t produced what you would call a full-function well-functioning system. I think we’ve written pieces of such a system.

温斯莱特:现在软件通常是免费的,而不仅仅是手机应用程序。例如,您最后一次购买编译器是什么时候?数据库软件会走同样的路吗?

Winslett: Software is often free now, and not just phone apps. For example, when was the last time you bought a compiler? Will database software go the same route?

Stonebraker:我认为这是一个有趣的问题,因为目前大多数最近的 DBMS 初创公司使用的模型都是免费增值的。它并不是真正的开源。它有一个开源的预告片,但任何要在生产中运行它的人都将获得非免费的片断,并且只有非免费的片断才提供支持。我认为免费增值模式运作良好,但它并不是一个真正完整的开源系统。

Stonebraker: I think that’s an interesting question, because right now, the model used by most of the recent DBMS startups is freemium. It isn’t really open source. It has a teaser piece that’s open source, but anyone who’s going to run it in production is going to get the non-free piece, and support only comes with the non-free piece. I think that the freemium model works fine, but it isn’t really a complete open-source system.

我认为看看大型云供应商是否拥有完整的免费开源系统将会很有趣,这些系统可能只会在他们的硬件上运行,这样他们就可以从他们的铁器中获得租金收入。现在,我很难想象一个真正投入生产使用的完整开源系统。

I think it will be interesting to see whether the big cloud vendors will have complete free open-source systems, which probably will only run on their hardware so they can get the rental income from their iron. Right now, I’m hard pressed to think of a complete open-source system that you’d actually put into production use.

温斯莱特:您骑自行车穿越了整个国家,并攀登了新罕布什尔州所有 48 座 4,000 多英尺高的山脉。体育运动如何影响您的职业生活?

Winslett: You biked across the country and climbed all forty-eight 4,000-plus foot mountains in New Hampshire. How has athletics affected your professional life?

Stonebraker:我天生就会积极尝试去实现困难的事情。在身体方面如此,在专业方面也是如此。这就是我的连线方式。那里没有什么有目的的。

Stonebraker: I’m wired to aggressively attempt to achieve what’s hard. That’s true in physical stuff, that’s true in professional stuff. That’s just the way I’m wired. There’s nothing purposeful there.

温斯莱特:您是 CIDR 的创始人之一。CIDR是成功还是失败?

Winslett: You were one of the founders of CIDR. Has CIDR been a success or a failure?

Stonebraker:我们启动 CIDR 是因为 SIGMOD 拒绝了实用论文。我认为多年来 CIDR 已被证明是一个展示实用内容的绝佳场所。主要会议试图通过其工业轨道获得更多务实的论文,并取得了一些成功。但他们仍然拒绝了我务实的论文,这仍然让我很生气。

Stonebraker: We started CIDR because SIGMOD was turning down practical papers. I think CIDR has proved to be a great venue for practical stuff over the years. The major conferences have attempted with some success to get more pragmatic papers, through their industrial tracks. But they still turn down my papers that are pragmatic, and that still pisses me off.

CIDR使主要会议发生了一些变化,但在我看来还不够。所以我认为 CIDR 仍然是主要会议不会涉及的实用论文的一个很好的出路。只要我们坚持我们的编织,CIDR 将是非常长期可行的。每次举办时,我们都必须关闭报名,因为报名人数过多。

CIDR has caused the major conferences to change some, but in my opinion not enough. So I think CIDR continues to be a great outlet for practical papers that the major conferences won’t touch. As long as we stick to our knitting, CIDR will be very viable long-term. Every time it’s held, we have to close registration because it’s over-subscribed.

温斯莱特:最小可出版单位仍然是我们社区的一个大问题吗?

Winslett: Are minimum publishable units still a big issue in our community?

Stonebraker:这是我最讨厌的事情。当我获得博士学位时,我的出版物为零。五年后,当我获得终身教职时,我有几个,也许有六个,这就是常态。大卫·德威特也是如此。这在当时是很典型的。现在你必须再提高一个数量级才能获得助理教授的工作或获得终身教职。这迫使每个人都以最少可发布单位 (LPU) 的方式进行思考,这造成了我们都必须阅读的令人眼花缭乱的垃圾海洋。我觉得这太糟糕了。我不知道有人是如何跟上洪水般的出版物的。我们所有人都告诉我们的研究生去读这篇或那篇论文并告诉我们它说了什么,因为没有人能读懂所有这些东西。

Stonebraker: That’s my favorite pet peeve. When I graduated with a Ph.D., I had zero publications. When I came up for tenure five years later, I had a handful, maybe six, and that was the norm. David DeWitt was the same way. That was typical back then. Now you have to have an order of magnitude more to get an assistant professor job or get tenure. That forces everybody to think in terms of Least Publishable Units (LPUs), which create a dizzying sea of junk that we all have to read. I think it’s awful. I don’t know how anybody keeps up with the deluge of publications. All of us tell our grad students to go read this or that paper and tell us what it says, because no one can physically read all that stuff.

我最喜欢的可能有效的策略是让 20 所美国顶尖大学说:“如果您向我们发送计算机科学助理教授职位的申请,请列出三篇出版物。我们不会再看下去了。选三个。我们不在乎你是否有更多。选三个。当你申请终身教职时,选择十个。如果你发表更多,我们就不想看它们。” 如果你让顶尖大学执行这一纪律,它可能会降低发表率,并开始促使人们将更多的 LPU 合并成更大更好的论文。

My favorite strategy which might work is to get the top, say, 20 U.S. universities to say, “If you send us an application for an assistant professor position in computer science, list three publications. We’re not going to look at any more. Pick three. We don’t care if you have more. Pick three. When you come up for tenure, pick ten. If you publish more, we don’t want to look at them.” If you got the top universities to enforce that discipline, it might knock down the publication rates and start getting people to consolidate more LPUs into bigger and better papers.

Winslett:您可能拥有数据库领域最大的家谱。这是你成功的一个重要因素吗?

Winslett: You might have the biggest family tree in the database field. Has that been an important factor in your success?

斯通布雷克:我不这么认为。我认为成功取决于拥有好的想法并拥有优秀的研究生和博士后来实现它们。我认为事实我认识很多人,其中一些是我的学生,一些是别人的学生,这并不那么重要。

Stonebraker: I don’t think so. I think success is determined by having good ideas and having good graduate students and postdocs to realize them. I think the fact that I know a lot of people, some of whom are my students and some of whom are other people’s students, is not all that significant.

温斯莱特:当你选择解决一个新问题时,你如何平衡智力满足感、你的行业伙伴对解决方案的渴望程度,以及你可能能够将其转变为一家初创公司的直觉?

Winslett: When you pick a new problem to work on, how do you balance intellectual satisfaction, your industry buddies’ depth of desire for a solution, and the gut feeling that you might be able to turn it into a startup?

Stonebraker:我根本不这么认为。我认为主要是找到某人在现实世界中遇到的问题并解决它。如果你以一种可商业化的方式解决它,那就是下游。在我的大学里,你不会因创业而获得任何奖励分。

Stonebraker: I don’t think in those terms at all. I think mostly in terms of finding a problem that somebody has in the real world, and working on it. If you solve it in a way that’s commercializable, that’s downstream. You don’t get any brownie points at my university for doing startups.

我认为我犯过的最大错误是当我们有一个原型并且一个学生真的想创业,但我非常不愿意时。通常我是对的。我创建过一些失败的初创公司,通常我一开始就认为它不会成功。当你拥有博士学位时,我发现很难抗拒。学生恳求你,“请创业吧。”

I think the biggest mistakes I’ve made have been when we had a prototype and a student really wanted to do a startup, but I was very reluctant. Usually I was right. I’ve created startups that have failed and usually I didn’t think it would work at the beginning. I find it hard to resist when you have a Ph.D. student pleading with you, “Please do a startup.”

温斯莱特:你在伯克利还是在麻省理工学院的工作效率更高?

Winslett: Were you more productive at Berkeley or at MIT?

Stonebraker:我认为我在麻省理工学院的工作效率要高得多。

Stonebraker: I think I’ve been much more productive at MIT.

温斯莱特:那是为什么呢?

Winslett: And why is that?

斯通布雷克:我不知道。如果你看一下数据,我在伯克利25年里做了3家初创公司,而我在麻省理工学院16年里做了6家初创公司。

Stonebraker: I have no idea. If you look at the data, I did 3 startups in 25 years at Berkeley, and I did 6 startups in 16 years at MIT.

温斯莱特:你认为哪一组影响更大?

Winslett: Which set do you think had more impact, though?

Stonebraker:排名第一的可能是 Postgres,排名第二的可能是 Vertica,所以无论哪种方式都不明显。

Stonebraker: Number one was probably Postgres and number two was probably Vertica, so it’s not obvious one way or the other.

温斯莱特:您最希望哪个成功的研究想法是您自己的?

Winslett: What successful research idea do you most wish had been your own?

Stonebraker:并行数据库,就像我们之前讨论的那样。

Stonebraker: Parallel databases, like we talked about earlier.

温斯莱特:如果你有机会重新做一件事,你会做什么?

Winslett: If you had the chance to do one thing over again, what would it be?

Stonebraker:我会在 70 年代从事并行数据库工作。

Stonebraker: I would have worked on parallel databases in the ’70s.

温斯莱特:一个反复出现的主题!您是否曾经有过一个被研究界拒绝但您仍然坚信并可能再次追求的想法?

Winslett: A recurring theme! Did you ever have an idea that the research community rejected but that you still believe in fervently and may pursue again?

Stonebraker:据我所知,如果一篇论文被拒绝,包括我在内的每个人都会重写它,直到它被接受。我不知道有哪篇论文真正被扔到了剪辑室的地板上。我们都会修改它们,直到它们被接受。

Stonebraker: If a paper gets rejected, as near as I can tell, everybody, including me, rewrites it until it gets accepted. I don’t know of any papers that actually went onto the cutting room floor. We all modify them until they get accepted.

温斯莱特:如果你是一年级数据导向研究生,你会选择做什么?

Winslett: If you were a first-year data-oriented graduate student, what would you pick to work on?

Stonebraker:如果你对如何做云有一个好主意……每个人都会迁移到云。这将带来巨大的干扰。我们将运行共享一百万个节点的数据库系统。

Stonebraker: If you have a good idea on how to do cloud stuff … everybody’s going to move to the cloud. That’s going to be a huge amount of disruption. We’re going to run database systems where we share a million nodes.

如果您对如何使数据集成发挥作用有一个好主意,那么这将是一个令人难以置信的难题,而且非常重要。如果您对数据库设计有好的想法,我会致力于此。如果您对数据清理有好主意,请务必继续努力。在野外找到一些你可以解决的问题并解决它。

If you have a good idea on how to make data integration work, it’s an unbelievably hard problem, and unbelievably important. If you have a good idea about database design, I would work on that. If you have a good idea on data cleaning, by all means, work on that. Find some problem in the wild that you can solve and solve it.

Winslett:目前数据库研究中有哪些热门话题是你认为浪费时间的?

Winslett: Are there any hot topics in database research right now that you think are a waste of time?

Stonebraker:我们过去认为的数据库核心能力现在在 SIGMOD 和 VLDB 中只占很小的一部分。该领域基本上已经支离破碎,面目全非。如今,关于核心数据库内容的论文很少出现。我认为我们正在所有这些分散的不同领域进行大量非前沿研究,我想知道这会产生什么长期影响。

Stonebraker: What we used to think of as database core competency is now a very minuscule portion of what appears in SIGMOD and VLDB. The field is basically fragmented beyond recognition. Very few papers on core database stuff appear these days. I think we’re doing a lot of non-cutting-edge research in all these fragmented different fields, and I wonder what the long-term impact of that is going to be.

我有点担心,因为数据库人员都打着可扩展性的幌子发表机器学习论文。那不是我们的社区。有一个 ML 社区对 ML 感到担忧。数据库人员不会在纯粹的机器学习会议上发表文章,我怀疑主要是因为我们是那里的二流研究人员。

I’m kind of a little bit worried, because the database guys are all publishing ML papers under the guise of scalability. That isn’t our community. There is an ML community that worries about ML. Database people don’t publish in pure ML conferences, mostly I suspect because we’re second-rate researchers there.

据我所知,SIGMOD 和 VLDB 大约有 300 名研究人员,他们和他们的研究生所做的一切都是各种各样的事情。决定什么可行、什么不可行变得非常非常分散。

As near as I can tell, SIGMOD and VLDB are 300 or so researchers, and everything they and their grad students are doing is a huge spectrum of stuff. Deciding what’s workable and what’s not workable becomes very, very diffuse.

温斯莱特:在像我们这样不断扩张的领域,这不是不可避免的吗?

Winslett: Isn’t that inevitable in an expanding field like ours?

Stonebraker:是的,但我认为如果你看看操作系统人员,他们就会开始在他们的场所写数据库论文。当你看到很多碎片时,在我看来,在某些时候我们可能应该重新组织计算机科学。卡内基梅隆大学和佐治亚理工学院的计算机科学学院似乎能够更好地围绕事物的这种分散性质进行组织。麻省理工学院没有。从长远来看,没有计算机科学学院的大学将处于不利地位。这是麻省理工学院和其他地方的政治烫手山芋。

Stonebraker: Yeah, but I think if you look at the operating system guys, they’re starting to write database papers in their venues. When you get a lot of fragmentation, at some point it seems to me that we probably ought to reorganize computer science. CMU and Georgia Tech have schools of computer science that seem much better able to organize around this diffuse nature of things. MIT doesn’t. The universities that don’t have schools of computer science will be disadvantaged long-run, long-term. That’s a political hot potato at MIT and elsewhere.

温斯莱特:您的哪些技术项目给您带来了最大的个人满意度?

Winslett: Which of your technical projects have given you the most personal satisfaction?

Stonebraker: Vertica 和 Postgres。我认为 Vertica 是最令人满意的,因为我们重写了 Postgres,然后又重写了它。我们开始在 LISP 中实现 Postgres,这是地球上最大的灾难。这大概就是我的有史以来最大的技术错误。Postgres 最终或多或少是正确的。Vertica 第一次就非常正确,我认为这非常了不起。通常,当您意识到自己搞砸了时,您会重写所有内容,然后当您意识到自己仍然搞砸时,您会再次重写。Vertica 第一次表现得非常好,我认为这非常了不起。

Stonebraker: Vertica and Postgres. I think Vertica was the most satisfying because Postgres we rewrote and then we rewrote it again. We started off implementing Postgres in LISP, which was the biggest disaster on the planet. That’s probably my biggest technical mistake ever. Postgres eventually got it more or less right. Vertica got it pretty much right the first time, which I thought was remarkable. Usually, you rewrite everything when you realize you screwed it up, and then you rewrite it again when you realize you still screwed it up. Vertica did pretty well the first time, which I thought was pretty remarkable.

温斯莱特:您的越野自行车之旅中最困难的部分是什么?

Winslett: What was the most difficult part of your cross-country bike trip?

Stonebraker:图灵奖演讲的地点是北达科他州,而北达科他州很糟糕。绝对可怕。倒不是因为它平坦又无聊,你花了一天的时间抬头看10英里,看到下一个城镇的谷物电梯,朝它骑了三刻钟,在几分钟内穿过城镇,然后然后前方 10 英里就是下一个谷物升降机。这真的很单调和无聊,但更困难的是我们在北达科他州一路与不可能的逆风作斗争。当你努力以每小时 7 英里的速度行驶,而你意识到这是一个 500 英里宽的州时,你会感到非常沮丧。

Stonebraker: The Turing Award lecture is set in North Dakota, and North Dakota was awful. Absolutely awful. Not so much because it’s flat and boring and you spend your day looking up ahead 10 miles, seeing the grain elevator in the next town, riding toward it for three-quarters of an hour, passing through the town in a couple of minutes, and then 10 miles up ahead is the next grain elevator. That’s really monotonous and boring, but what made it impossibly hard was that we were fighting impossible headwinds all the way across North Dakota. It is so demoralizing when you’re struggling to make seven miles an hour and you realize that it’s a 500-mile-wide state.

温斯莱特:你职业生涯中的北达科他州是什么样的?

Winslett: What was the North Dakota of your career?

Stonebraker:我认为我在伯克利获得终身教职绝非轻而易举的事。我认为该部门可能会竭尽全力来实现这一目标。当时,数据库还是一潭死水,有人有足够的远见来提拔我。

Stonebraker: I think it was by no means a slam dunk that I was going to get tenure at Berkeley. I think the department probably went out on a limb to make that happen. At the time, databases were this little backwater, and somebody had enough vision to promote me.

我认为获得终身教职所带来的压力对每个人来说都是可怕的,而且无论你是谁,你准备获得终身教职的那一年都是可怕的。我个人认为我们不应该让助理教授承受这种压力。我们应该发明一个更好的任期制度或渐进的任期制度或其他什么。我们给助理教授带来的压力非常可怕。

The stress associated with getting tenure I think is awful for everybody, universally, and the year that you’re up for tenure is horrible, no matter who you are. I personally think we shouldn’t subject assistant professors to that kind of stress. We should invent a better tenure system or gradual tenure system or something else. The stress level we subject assistant professors to is awful.

温斯莱特:如果你现在退休了,你会做什么?

Winslett: If you were retired now, what would you be doing?

Stonebraker:这相当于这样一个问题:“当我作为一名研究人员不再有竞争力时,我该怎么办?”

Stonebraker: That’s equivalent to the question, “What do I do when I’m no longer competitive as a researcher?”

温斯莱特: “退休”的定义是当你作为一名研究人员不再具有竞争力时?

Winslett: The definition of “retired” is when you’re no longer competitive as a researcher?

斯通布雷克:是的。我去上班。我会做我正在做的事情,直到我没有竞争力为止。我早上醒来,我喜欢我所做的事情。我唯一讨厌的工作是编辑学生论文。学生们普遍写不出一文不值的文章。和其他人一样,我也被困在修改他们的文件上。我讨厌那个。

Stonebraker: Yeah. I’m going to work. I’m going to do what I’m doing until I’m not competitive doing it. I wake up in the morning and I like what I do. The only aspect of my job that I hate is editing student papers. Students by and large can’t write worth a darn. Like everybody else, I’m stuck fixing their papers. I hate that.

你们年轻人真正不必面对,或者你们可能没有想到的真正问题是,当我退休时,我的身体健康状况如何。如果我受伤了,那么生活就会完全不同。如果我功能齐全,我会更多地徒步旅行,更多地骑自行车。我总是威胁要做木工。戴夫·德威特嘲笑我,因为我有一家基本上闲置的木工店。我会花更多的时间和我的妻子在一起。我会在新罕布什尔州度过更多的时间。

The real question you youngsters don’t really have to face, or you probably don’t think about, is what the state of my physical health is when I retire. If I’m impaired, then life’s in a whole different ballpark. If I’m full function, I will hike a lot more, I will bike a lot more. I always threaten to do woodworking. Dave DeWitt laughs at me because I have a woodworking shop that is essentially unused. I would spend more time with my wife. I would spend more time up here in New Hampshire.

我认为真正的答案是,如果我自己的想法不可行,我可能会成为某种风险投资家。我非常擅长帮助人们创办公司,所以我可能会做很多这样的事情。

I think the real answer is that I would probably become a venture capitalist of sorts if I’m not viable with my own ideas. I’m very good at helping people start companies, so I would probably do a whole bunch of that.

温斯莱特:有人告诉我应该请你唱《德克萨斯黄玫瑰》。

Winslett: I’m told that I should ask you to sing “The Yellow Rose of Texas.”

Stonebraker:只有在喝了很多很多啤酒或葡萄酒之后。

Stonebraker: Only after many, many beers or glasses of wine.

我不知道这个问题从何而来……这个问题唯一可能来自于当 Illustra 人员邀请我参加他们的一次销售奖励会议时,每个人都必须站起来唱卡拉 OK。我不认为我拍过《德克萨斯黄玫瑰》,但也许我拍过。那是我唯一记得的一次……我想知道这个问题是从哪里来的?

I don’t know where that question came from … the only place that question could have come from was from when the Illustra guys invited me to one of their sales reward meetings, and everybody had to get up and do karaoke. I don’t think I did “The Yellow Rose of Texas,” but maybe I did. That’s the only time I can remember … I wonder, where did that question come from?

温斯莱特:我永远不会泄露我的资料来源,即使我记得是谁提供的,但我不记得。

Winslett: I would never give away my sources, even if I remembered who contributed that one, which I don’t.

Stonebraker:无论如何,我是世界上最差的歌手。我的歌声非常单调。

Stonebraker: Anyway, I’m the world’s worst singer. I sing in a great monotone.

温斯莱特:我听说你在蓝草乐队中演奏班卓琴。是什么让您对班卓琴和蓝草音乐感兴趣?

Winslett: I hear you play the banjo for a bluegrass band. What got you interested in the banjo and bluegrass music?

Stonebraker:当我和第一任妻子在 1975 年分居时,我在几个月内买了一把班卓琴,然后问卖班卓琴的人,“你用这个演奏什么类型的音乐?” 我不知道为什么我选择班卓琴。在我的历史上从来没有人拥有过班卓琴,所以我不知道为什么我决定拿起它。

Stonebraker: When my first wife and I separated in 1975, I went out and bought a banjo within a couple months, then asked the guy who sold the banjo, “What kind of music do you play with this?” I have no idea why I chose the banjo. There’s nowhere in my history that anybody ever had a banjo, so I don’t have any idea why I decided to take it up.

有孩子阻碍了我玩的时间,但孩子们长大后,我又开始玩了。我现在加入了一个名为“Shared Nothing”的乐队。不是 Shared Nothings(复数),而是 Shared Nothing(单数),与技术分布式数据库术语Shared Nothing完全相同。我们每隔几周就会即兴演奏一次,但称我们为乐队有点牵强。我们的目标是开始在辅助生活中心打球,因为这些地方总是在为居民寻找活动。

Having kids got in the way of having time to play, but after the kids were adults, then I started playing again. I’m now in a band of sorts, called Shared Nothing. Not Shared Nothings (plural), it’s Shared Nothing (singular), exactly like the technical distributed database term shared nothing. We jam every couple of weeks but calling us a band is a bit of a stretch. Our goal is to start playing at assisted living centers because those places are always looking for something for their residents to do.

温斯莱特:你已经达到那种专业水平了吗?

Winslett: Have you reached that level of expertise yet?

Stonebraker:我认识一个朋友,他的父亲在辅助生活机构工作,他向我推荐了他们的娱乐总监。我说:“我们可以来为你打球吗?”各位?” 她说给她寄一盘磁带。所以,我们制作了一盘磁带,但那是我们最后一次收到她的消息。我们还没有达到在辅助生活中心比赛的水平。

Stonebraker: I know a friend whose father is in an assisted living facility, and he pointed me to their entertainment director. I said, “Can we come play for your folks?” She said to send her a tape. So, we made a tape, but that was the last we heard from her. We’re not yet at the level of playing at assisted living centers.

温斯莱特:这是值得追求的事情。

Winslett: It’s something to aspire to.

我们已经足够优秀,可以在我的博士面前打球了。学生,他们是被俘虏的观众。

We’re good enough to play in front of my Ph.D. students, who are a captive audience.

温斯莱特:我听说你经常穿红衬衫。

Winslett: I hear that you wear a lot of red shirts.

是的。

Yep.

温斯莱特:为什么?你拥有多少?

Winslett: Why, and how many do you own?

Stonebraker:大约15。我喜欢红色。我有一艘红色的船。很长一段时间,我开的是红色汽车。无论出于何种原因,红色都是我最喜欢的颜色,而且我穿红色衬衫,尽管今天不是。

Stonebraker: Approximately 15. I like red. I have a red boat. For a long time, I drove a red car. Red is my favorite color for whatever reason and I wear red shirts, although not today.

温斯莱特:为了避免参加越南战争,你必须在大学毕业后直接进入研究生院。您曾说过,这迫使您过早地走上职业道路,而没有时间探索其他选择。事后看来,如果没有选秀,你认为你会走哪条路?

Winslett: To avoid the draft for the Vietnam War, you had to go straight to grad school after college. You have said that this forced you into a career path prematurely, without time to explore other options. In hindsight, what path do you think you would have taken if there hadn’t been a draft?

Stonebraker: 1965年我大学毕业时,我的选择是去读研究生,去越南,去监狱,或者去加拿大。这些都是正确的选择。如果我去读研究生,那是在后人造卫星科学热潮时期,所以我将获得全额奖学金,以便在研究生院旁听战争。你为什么不这样做呢?你必须在研究生院读到 26 岁,而政府不再需要你参加战争。这迫使我获得博士学位。如果没有这种压力,我想我永远不会获得这样的成就。征兵的威胁是一个强大的动力。

Stonebraker: When I graduated from college in 1965, my choices were go to graduate school, go to Vietnam, go to jail, or go to Canada. Those were the exact choices. If I went to graduate school, it was right at the time of the post-Sputnik science craze, so I would have a full-ride fellowship to sit out the war in graduate school. Why wouldn’t you do that? You had to sit in graduate school until you were 26 and the government didn’t want you for the war anymore. That forced me to get a Ph.D. that I don’t think I ever would have gotten without that kind of pressure. The threat of the draft was a powerful motivator.

您可能还不够大,不记得名为 66 号公路的电视节目。您听说过吗?

You’re probably not old enough to remember the TV show called Route 66. You ever heard of that?

温斯莱特:是的,但我还没看过。

Winslett: Yes, but I haven’t watched it.

Stonebraker:这两个人开着克尔维特在全国各地行驶,拥有丰富的经验。当我大学毕业时,我不知道自己想做什么,所以如果可以的话我会选择 66 号公路。我不知道那会导致什么结果,但我的生活肯定会非常不同。

Stonebraker: It’s these two guys who drive around the country in a Corvette and have great experiences. When I graduated from college, I had no idea what I wanted to do, and so I would have done the Route 66 thing if I could. I have no idea where that would have led, but my life would have certainly been very different.

温斯莱特:如今年轻的计算机科学家还需要计算机科学学位吗?

Winslett: Do young computer scientists even need a computer science degree today?

Stonebraker:在我看来,是的。我们之前讨论过谷歌。谷歌有很多数据库项目,他们把这些项目分配给擅长其他领域的人,但他们把它们搞砸了。他们实施了短期系统从长远来看是不可行的。我们多年来积累了很多值得了解的理论和实用知识。我不知道有什么比学习计算机科学更好的方法了。

Stonebraker: In my opinion, yes. We talked about Google earlier. Google had a bunch of database projects that they assigned to people who were skilled at other stuff, and they screwed them up. They implemented short-term systems that weren’t viable long-term. There’s a lot of theory and pragmatics that we’ve accumulated over the years that is useful to know. I don’t know a better way to do it than by studying computer science.

总的来说,如果你看看周围学科中实际上最终从事计算机科学的人,就像大多数物理科学的学生一样,然后你问他们对计算机科学真正做出了什么贡献,答案是这并不是很引人注目。无论出于何种原因,计算机科学的进步往往来自受过计算机科学家培训的人。

By and large, if you look at people in surrounding disciplines who actually end up doing computer science, like students in most of the physical sciences, and you ask what contribution they have really made to computer science, the answer is it’s not very dramatic. For whatever reason, computer science advances tend to come from people who are trained as computer scientists.

温斯莱特:您对研究生有什么建议吗?

Winslett: Do you have a philosophy for advising graduate students?

Stonebraker:简单的答案是,作为一名教师,你的职责是让你的学生取得成功。当你雇用某人时,这基本上是一个让他们成功的协议,如果他们退出,那么你就是一个失败。我非常非常努力地让我的学生成功。通常,这意味着当他们没有任何自己的想法时,向他们提供好的想法,并向他们施加压力,说:“VLDB 的截止日期是三周后,事实上,你可以提交一篇论文。进展必须距离截止日期的距离呈指数级增长,但你可以做到。”

Stonebraker: The simple answer is that as a faculty member, your charge is to make your students successful. When you take someone on, it’s basically an agreement to make them successful, and if they drop out then you’re a failure. I try very, very hard to make my students successful. Usually that means feeding them good ideas when they don’t have any of their own, and pushing them hard, saying, “The VLDB deadline is in three weeks, and you can, in fact, get a paper in. Progress will have to be exponential in the distance to the deadline, but you can do it.”

成为一名啦啦队长,推动你的学生以比他们想象的更快的速度完成任务。当它们脱离轨道时,就像往常一样,将它们拉回到轨道上。这确实需要很多时间。我每周或更多次与所有学生见面。我的工作是成为啦啦队长、创意产生者和鼓励者。

Be a cheerleader and push your students to get stuff done at a rate much faster than they think they can. When they go off the rails, as they always do, pull them back onto the rails. This does take a lot of time. I meet with all my students once a week or more. My job is to be the cheerleader, idea generator, and encourager.

温斯莱特:如果你神奇地有足够的额外时间在工作中做一件你现在没有做的额外事情,那会是什么?

Winslett: If you magically had enough extra time to do one additional thing at work that you’re not doing now, what would it be?

Stonebraker:如果我有一个好主意,我就会开始研究它,即使我没有任何额外的时间。我适合它。所以,我没有一个好主意,只是坐着等待一些时间来处理它。我不知道如果有多余的时间我会做什么。早上起床后,没有什么事可做,这让我发疯。我一直很忙,我不知道空闲时间我会做什么。

Stonebraker: If I have a good idea I start working on it, even if I don’t have any extra time. I fit it in. So, I don’t have a good idea just sitting, waiting for some time to work on it. I don’t know what I’d do if I had extra time. Getting up in the morning and having nothing that I have to do drives me crazy. I stay very busy, and I don’t know what I would do with free time.

温斯莱特:作为一名计算机科学研究人员,如果你能改变自己的一件事,你会改变什么?

Winslett: If you could change one thing about yourself as a computer science researcher, what would it be?

Stonebraker:我会学习如何编码。

Stonebraker: I’d learn how to code.

温斯莱特:上次我们接受采访时您就是这么说的,但显然,既然您在不会编码的情况下取得了巨大的成功,那么就没有必要了。

Winslett: That’s what you said the last time we did an interview, but obviously, since you’ve achieved high success without being able to code, it must not be necessary.

Stonebraker:我知道,但这很尴尬。这是需要很多时间来学习的东西,而我没有时间。如果我能神奇地创造出很多时间,我就会学习如何编码。

Stonebraker: I know, but it’s embarrassing. It’s something that takes a lot of time to learn, time that I don’t have. If I could magically create a lot of time, I’d learn how to code.

温斯莱特:也许当你顶着风穿越北达科他州时,你可以在车把上的小键盘上练习。那样的话就不会那么痛苦了。

Winslett: Maybe while you were going across North Dakota in the headwinds, you could have been practicing on a little keyboard on the handlebars. That would have made it less painful.

Stonebraker:我想你从来没有去过北达科他州。

Stonebraker: I don’t think you’ve ever been to North Dakota.

温斯莱特:我有,我有。从你的描述来看,这听起来很像伊利诺伊州。秘诀就是顺风骑行。也许你可以走到该州的尽头,然后以相反的方向骑行穿过它。你仍然会穿越北达科他州,但风会帮助你。

Winslett: I have, I have. And from your description, it sounds a lot like Illinois. The secret is to ride with the wind behind you. Maybe you could have gone to the far end of the state and ridden across it in the reverse direction. You still would have crossed North Dakota, but the wind would have been helping you.

斯通布雷克:给你。

Stonebraker: There you go.

温斯莱特:非常感谢你今天接受我的采访。

Winslett: Thank you very much for talking with me today.

1 . 此对话的视频版本也可在https://www.youtube.com/watch?v=vQIkkDaw6iE上找到。

1. A video version of this conversation is also available at https://www.youtube.com/watch?v=vQIkkDaw6iE.

第四部分

PART IV

大局观

THE BIG PICTURE

3

3

领导和宣传

Leadership and Advocacy

菲利普·伯恩斯坦

Philip A. Bernstein

Mike Stonebraker比任何人都更明确了过去40年数据库系统实现架构的研究议程:关系数据库、分布式数据库、对象关系数据库、大规模分布式联邦数据库和专用数据库。在每一个案例中,他都是开创性的研究工作,主张不同的系统级架构类型的数据库系统。他提出了架构系统类型,证明了其重要性,传播了研究议程,以在数据库社区内创建一个新主题,并构建了一个成功的原型,后来他将其作为产品进入商业世界,其中最著名的是 Ingres 和 Postgres以及有影响力的事例。正是由于这些努力,他当之无愧地获得了 ACM 图灵奖。

More than anyone else, Mike Stonebraker has set the research agenda for database system implementation architecture for the past 40 years: relational databases, distributed databases, object-relational databases, massively distributed federated databases, and specialized databases. In each of these cases, his was the groundbreaking research effort, arguing for a different system-level architecture type of database system. He proposed the architecture system type, justified its importance, evangelized the research agenda to create a new topic within the database community, and built a successful prototype that he later moved into the commercial world as a product, Ingres and Postgres being the most well known and influential examples. It is for these efforts that he richly deserves the ACM Turing Award.

自从我们在 1975 年 SIGMOD 会议上第一次见面以来,我一直在关注 Mike 的工作。尽管从那时起我们每年都会有几次交集,但我们从未在研究项目上进行过合作。因此,与本书各章节的大多数作者不同,我没有个人的项目合作经历可供叙述。相反,我将大致按时间顺序关注他的想法。我会先描述系统,然后描述机制,最后描述他对数据库领域的倡导。

I have been following Mike’s work since we first met at the 1975 SIGMOD Conference. Despite having crossed paths a few times per year ever since then, we’ve never collaborated on a research project. So, unlike most authors of chapters of this book, I don’t have personal experiences of project collaborations to recount. Instead, I will focus on his ideas, roughly in chronological order. I will first describe the systems, then mechanisms, and finally his advocacy for the database field.

系统

Systems

故事要从安格尔项目开始。1973年该项目启动时,第一代数据库系统产品已经成熟。他们使用一次记录编程接口,其中许多接口遵循拟议的 CODASYL 数据库标准,查尔斯·巴赫曼 (Charles Bachman) 因该标准获得了 1973 年图灵奖。这些产品的供应商认为关系模型不可行,或者至少太难实现——尤其是针​​对 16 位小型计算机(PDP-11/40),就像 Ingres 项目所做的那样。此外,数据库管理是一个主题对于商学院,而不是计算机科学系。它与用于商业数据处理的 COBOL 编程有松散的联系,后者在学术计算机科学研究界不受尊重。当时,IBM 在计算机市场占据主导地位,并大力推广其分层数据库系统 IMS。关系模型值得实施的唯一一线希望是 IBM Research 启动了 IBM System R 项目(请参阅第 35 章)。在 1973 年的世界里,迈克·斯通布雷克 (Mike Stonebraker) 和吉恩·王 (Gene Wong) 非常勇敢地将安格尔作为他们研究的焦点。

The story starts with the Ingres project. At the project’s outset in 1973, the first generation of database system products were already well established. They used record-at-a-time programming interfaces, many of which followed the proposed CODASYL database standard, for which Charles Bachman received the 1973 Turing Award. Vendors of these products regarded the relational model as infeasible, or at least too difficult to implement—especially targeting a 16-bit minicomputer (a PDP-11/40), as the Ingres project did. Moreover, database management was a topic for business schools, not computer science departments. It was loosely associated with COBOL programming for business data processing, which received no respect in the academic computer science research community. In those days, IBM had a dominant share of the computer market and was heavily promoting its hierarchical database system, IMS. The only ray of hope that the relational model was worth implementing was that IBM Research had spun up the IBM System R project (see Chapter 35). In the world of 1973, Mike Stonebraker and Gene Wong were very brave in making Ingres the focus of their research.

正如他们所说,剩下的就是历史了。Ingres 项目(参见第 15 章)取得了巨大的研究成功,产生了有关数据库系统许多主要组件的早期论文:访问方法、视图和完整性机制、查询语言和查询优化。许多参与该系统的学生成为了领先的数据库研究人员和开发人员。此外,Ingres在学术研究项目中是独一无二的,因为它的原型被广泛分布并被应用程序使用(见第12章)。最终,Ingres 本身成为了一个成功的商业产品。

As they say, the rest is history. The Ingres project (see Chapter 15) was a big research success, generating early papers on many of the main components of a database system: access methods, a view and integrity mechanism, a query language, and query optimization. Many of the students who worked on the system became leading database researchers and developers. Moreover, Ingres was unique among academic research projects in that its prototype was widely distributed and was used by applications (see Chapter 12). Ultimately, Ingres itself became a successful commercial product.

1984 年,Mike 和他在加州大学伯克利分校的同事 Larry Rowe 启动了一个后续项目 Postgres(参见第 16 章),以纠正 Ingres 中的许多功能限制。那时,数据库研究界已经清楚地意识到,第一代关系数据库对于计算机辅助设计和地理信息系统等工程应用来说并不理想。为了将关系系统的范围扩展到这些应用程序,Mike 和 Larry 为 Postgres 提出了几个新功能,其中最重要的是用户定义的数据类型。

In 1984, Mike and his UC Berkeley colleague Larry Rowe started a follow-on project, Postgres (see Chapter 16), to correct many of the functional limitations in Ingres. By that time, it was apparent to the database research community that the first generation of relational databases was not ideal for engineering applications, such as computer-aided design and geographical information systems. To extend the reach of relational systems to these applications, Mike and Larry proposed several new features for Postgres, the most important of which was user-defined datatypes.

抽象数据类型的概念在当时是一个很好理解的概念,由 Barbara Liskov 和 Stephen Zilles 在 20 世纪 70 年代中期首创 [Liskov and Zilles 1974]。但它还没有进入关系系统。Mike 让他的学生为 Ingres 制作了一个抽象数据类型插件的原型,然后他在 Postgres 中重新应用了该插件 [Stonebraker 1986b, Stonebraker 1986c]。这是构建可扩展数据库系统的最早方法之一,并且已成为主导方法,现在通常称为用户定义数据类型。它催生了另一家初创公司,该公司基于 Postgres 开发了 Illustra 系统。Illustra 的主要功能是使用抽象数据类型的可扩展性,他们将其称为“数据刀片”。Illustra 于 1996 年被 Informix 收购,后来又被 IBM 收购。

The notion of abstract data type was a well-understood concept at that time, having been pioneered by Barbara Liskov and Stephen Zilles in the mid-1970s [Liskov and Zilles 1974]. But it had not yet found its way into relational systems. Mike had his students prototype an abstract data type plugin for Ingres, which he then reapplied in Postgres [Stonebraker 1986b, Stonebraker 1986c]. This was among the earliest approaches to building an extensible database system, and it has turned out to be the dominant one, now commonly called a user-defined datatype. It led to another startup company, which developed the Illustra system based on Postgres. Illustra’s main feature was extensibility using abstract data types, which they called “data blades.” Illustra was acquired in 1996 by Informix, which was later acquired by IBM.

20 世纪 90 年代中期,Mike 领导了他的下一个大型系统 Mariposa 的开发,这是一个针对异构数据的地理分布式数据库系统。它位于独立管理的数据库系统网络上,其资源 Mariposa无法控制。因此,Mariposa 引入了一种经济模型,其中每个数据库竞价执行部分查询计划。全局优化器选择优化查询并适合用户预算的出价。这是一种新颖且概念上有吸引力的方法。然而,与 Ingres 和 Postgres 不同,Mariposa 并没有取得重大的商业成功,也没有创造数据库系统设计的新趋势。它确实催生了一家初创公司 Cohera Corporation,该公司最终使用 Mariposa 的异构查询技术进行目录集成,这是企业对企业电子商务面临的紧迫问题。

In the mid-1990s, Mike led the development of his next big system, Mariposa, which was a geo-distributed database system for heterogeneous data. It was layered on a network of independently managed database systems whose resources Mariposa could not control. Therefore, Mariposa introduced an economic model where each database bids to execute part of a query plan. The global optimizer selects bids that optimize the query and fit within the user’s budget. It is a novel and conceptually appealing approach. However, unlike Ingres and Postgres, Mariposa was not a major commercial success and did not create a new trend in database system design. It did lead to a startup company, Cohera Corporation, which ultimately used Mariposa’s heterogeneous query technology for catalog integration, which was a pressing problem for business-to-business e-commerce.

从 2002 年开始,Mike 开始了专门针对特定使用模式的数据库系统的新浪潮:用于传感器和其他实时数据源的流数据库、用于数据仓库的列存储、用于事务处理的主内存数据库以及用于数据处理的数组数据库。科学加工。通过这一工作,他再次为研究领域和产品制定了议程。他认为,关系数据库产品已经变得如此庞大且难以修改,以至于指望它们能够应对这些工作负载所带来的挑战是不可能的。他的口号是“一刀切”。在每种情况下,他都表明,针对新工作负载定制的新系统的性能将比现有系统高出几个数量级。

Starting in 2002, Mike embarked on a new wave of work on database systems specialized for particular usage patterns: stream databases for sensors and other real-time data sources, column stores for data warehousing, main memory databases for transaction processing, and array databases for scientific processing. With this line of work, he again set the agenda for both the research field and products. He argued that relational database products had become so large and difficult to modify that it was hopeless to expect them to respond to the challenges presented by each of these workloads. His tag line was that “one size does not fit all.” In each case, he showed that a new system that is customized for the new workload would outperform existing systems by orders of magnitude.

对于每项工作负载,他都遵循相同的策略。首先,有研究论文展示了针对工作负载进行优化的系统的潜力。接下来,他领导了一个开发原型的项目。最后,他与他人共同创办了一家初创公司,将原型商业化。他在所有这些领域都创办了初创公司。Aurora 研究原型促成了用于流处理的 StreamBase Systems 的成立。便利店研究原型促成了专为零售店设计的 Vertica Systems(被 HP 收购,现为 Micro Focus 所有)的成立。H-Store 研究原型促成了 VoltDB 的成立(http://www.voltdb.com/)用于主存事务处理。SciDB 项目是一个用于阵列处理的数据库系统,最初是一个开源项目,由许多不同机构的研究人员贡献,并导致了 Paradigm4 (http://www.paradigm4.com/) 的成立

For each of these workloads, he followed the same playbook. First, there were research papers showing the potential of a system optimized for the workload. Next, he led a project to develop a prototype. Finally, he co-founded a startup company to commercialize the prototype. He founded startups in all of these areas. The Aurora research prototype led to the founding of StreamBase Systems for stream processing. The C-Store research prototype led to the founding of Vertica Systems (acquired by HP and now owned by Micro Focus) for column stores. The H-Store research prototype led to the founding of VoltDB (http://www.voltdb.com/) for main memory transaction processing. And the SciDB project, a database system for array processing, started as an open-source project with contributions by researchers at many different institutions, and led to the founding of Paradigm4 (http://www.paradigm4.com/).

一些供应商已经接受了挑战,并展示了如何修改现有产品来处理这些工作负载。例如,微软现在提供 SQL Server 的列存储组件(“Apollo”)和 SQL Server 的主内存事务数据库组件(“Hekaton”)。也许工作负载的变化和硬件的改进使得供应商不可避免地为现有产品开发这些新功能。但毫无疑问,Mike 推动该领域将这些挑战列为优先事项,从而极大地加速了该功能的开发。

Some vendors have picked up the gauntlet and shown how to modify existing products to handle these workloads. For example, Microsoft now offers a column store component of SQL Server (“Apollo”), and a main memory transactional database component of SQL Server (“Hekaton”). Perhaps the changes of workload and improvements in hardware made it inevitable that vendors would have developed these new features for existing products. But there is no doubt that Mike greatly accelerated the development of this functionality by pushing the field to move these challenges to the top of its priority list.

数据库管理中最困难的问题之一是异构数据的集成。2006 年,他与 Joachim Hammer 合作启动了 Morpheus 项目,该项目开发了一系列用于数据集成场景的数据转换。这催生了 Goby 初创公司,该公司利用此类数据转换来集成数据源以支持本地搜索;该公司于 2011 年被 Telenauv 收购。同样在 2011 年,他启动了 Data Tamer 项目,该项目提出了数据管理的端到端解决方案。该项目成为另一家初创公司 Tamr 的基础,该公司正在为许多大型企业解决数据集成问题(http://tamr.com),特别是将大量数据源统一为一致的数据集以进行数据分析。

One of the most difficult problems in database management is integration of heterogeneous data. In 2006, he started the Morpheus project in collaboration with Joachim Hammer, which developed a repertoire of data transformations for use in data integration scenarios. This led to the Goby startup, which used such data transformations to integrate data sources in support of local search; the company was acquired by Telenauv in 2011. Also in 2011, he started the Data Tamer project, which proposed an end-to-end solution to data curation. This project became the basis for another startup, Tamr, which is solving data integration problems for many large enterprises (http://tamr.com), especially unifying large numbers of data sources into a consistent dataset for data analytics.

机制

Mechanisms

除了作为一个有远见的人做出上述架构贡献之外,Mike 还发明了重要且创新的方法来构建系统组件,这些组件支持这些架构,并且现在已在所有主要数据库产品中使用。Ingres系统有很多例子,例如以下。

In addition to making the above architectural contributions as a visionary, Mike also invented important and innovative approaches to building the system components that enable those architectures and that are now used in all major database products. There are many examples from the Ingres system, such as the following.

• 查询修改(1975)——视图和完整性约束的实现可以简化为普通的查询处理,方法是将它们表示为查询并将它们替换为查询作为查询执行的一部分[Stonebraker 1975]。这种技术现在称为视图展开,广泛应用于当今的产品中。

•  Query modification (1975)—The implementation of views and integrity constraints can be reduced to ordinary query processing by expressing them as queries and substituting them into queries as part of query execution [Stonebraker 1975]. This technique, now known as view unfolding, is widely used in today’s products.

• B 树在关系数据库中的使用(1976-1978)——一份极具影响力的技术报告列出了在关系数据库系统中使用B 树的问题。这个问题列表定义了接下来几年 B 树研究的研究议程,这有助于将其用作当今关系数据库系统的标准访问方法。鉴于当时期刊出版的延迟多年,当该论文发表在 CACM(ACM 通讯)[Held 和 Stonebraker 1978] 上时,其中列出的许多问题已经得到解决。

•  Use of B-trees in relational databases (1976–1978)—A highly influential technical report listed problems with using B-trees in relational database systems. This problem-list defined the research agenda for B-tree research for the following several years, which contributed to its use as today’s standard access method for relational database systems. Given the multi-year delay of journal publication in those days, by the time the paper appeared in CACM (Communications of the ACM) [Held and Stonebraker 1978], many of the problems listed there had already been solved.

• 主副本复制控制(1978)——为了实现数据复制,数据的一个副本被指定为应用所有更新的主副本[Stonebraker 1978]。这些更新将传播到只读副本。现在,这已成为所有主要关系数据库系统产品中的标准复制机制。

•  Primary-copy replication control (1978)—To implement replicated data, one copy of the data is designated as the primary to which all updates are applied [Stonebraker 1978]. These updates are propagated to read-only replicas. This is now a standard replication mechanism in all major relational database system products.

• 锁定方法的性能评估(1977-1984)——通过一系列博士研究。在加州大学伯克利分校的论文中,他展示了精细的重要性调整锁定系统以获得令人满意的交易性能 [Ries and Stonebraker 1977a, Ries and Stonebraker 1977b, Ries and Stonebraker 1979, Carey and Stonebraker 1984, Bhide and Stonebraker 1987, Bhide and Stonebraker 1988]。

•  Performance evaluation of locking methods (1977–1984)—Through a sequence of Ph.D. theses at UC Berkeley, he showed the importance of fine tuning a locking system to gain satisfactory transaction performance [Ries and Stonebraker 1977a, Ries and Stonebraker 1977b, Ries and Stonebraker 1979, Carey and Stonebraker 1984, Bhide and Stonebraker 1987, Bhide and Stonebraker 1988].

• 在关系数据库中实现规则(1982-1988,1996)——他展示了如何在数据库系统中实现规则,并提倡这种方法而不是当时流行的基于规则的人工智能系统。他最初将其构建到 Ingres [Stonebraker 等人。1982a,斯通布雷克等人。1983c,Stonebraker 等人。1986] 并将更强大的设计融入 Postgres [Stonebraker 等人。1987c,Stonebraker 等人。1988a,斯通布雷克等人。1989 年,Stonebraker 1992a,钱德拉等人。1994,Potamianos 和 Stonebraker 1996]。后来他将其扩展到更大规模的触发系统,这在当今的数据库产品中很流行。

•  Implementing rules in a relational database (1982–1988, 1996)—He showed how to implement rules in a database system and advocated this approach over the then-popular rule-based AI systems. He initially built it into Ingres [Stonebraker et al. 1982a, Stonebraker et al. 1983c, Stonebraker et al. 1986] and incorporated a more powerful design into Postgres [Stonebraker et al. 1987c, Stonebraker et al. 1988a, Stonebraker et al. 1989, Stonebraker 1992a, Chandra et al. 1994, Potamianos and Stonebraker 1996]. He later extended this to larger-scale trigger systems, which are popular in today’s database products.

• 存储过程(1987)——在Postgres 系统中,他证明了将面向应用程序的过程合并到数据库系统引擎中以避免上下文切换开销的重要性。在他以前的学生 Bob Epstein 的领导下,这成为 Sybase 成功的关键功能,并且是所有数据库系统产品的基本功能。

•  Stored procedures (1987)—In the Postgres system, he demonstrated the importance of incorporating application-oriented procedures inside the database system engine to avoid context-switching overhead. This became the key feature that made Sybase successful, led by his former student Bob Epstein, and is an essential feature in all database system products.

宣传

Advocacy

除了上述产生新数据库系统产品类别的架构努力之外,Mike 还是早期倡导者(通常是主要倡导者),致力于将注意力集中在其他关键的系统级数据管理问题和方法上。其中包括遗留应用程序的集成,确保分布式数据库系统能够横向扩展,避免过多层数据库中间件的低效率,使关系数据库能够针对垂直应用程序进行定制,并避免一刀切的数据库的不灵活性系统。在他的整个职业生涯中,他一直是数据库领域的技术良知,一个反传统者,不断询问我们是否正在解决正确的问题,并使用最好的系统架构来解决当今最昂贵的数据工程问题。例子包括无共享分布式数据库架构(当今的主导方法)[Stonebraker 1985d,Stonebraker 1986d],优先采用对象关系数据库而不是面向对象数据库[Stonebraker 等人。1990c,Stonebraker 等人。1990d],一种迁移遗留应用程序的渐进方法[Brodie and Stonebraker 1995a],合并应用程序服务器和企业应用程序集成(EAI)系统[Stonebraker 2002],用更专业的数据库引擎替换通用的数据库系统 [Stonebraker 和 Çetintemel 2005,Stonebraker 2008b]。

In addition to the above architectural efforts that resulted in new database system product categories, Mike was an early advocate—often the leading advocate—for focusing attention on other critical system-level data management problems and approaches. These included the integration of legacy applications, ensuring distributed database systems will scale out, avoiding the inefficiency of too many layers of database middleware, enabling relational databases to be customized for vertical applications, and circumventing the inflexibility of one-size-fits-all database systems. For his entire career, he has been the database field’s technical conscience, an iconoclast who continually asks whether we are working on the right problems and using the best system architectures to address the most expensive data engineering problems of the day. Examples include shared-nothing distributed database architecture (today’s dominant approach) [Stonebraker 1985d, Stonebraker 1986d], adopting object-relational databases in preference over object-oriented databases [Stonebraker et al. 1990c, Stonebraker et al. 1990d], an incremental approach to migrating legacy applications [Brodie and Stonebraker 1995a], merging application servers and enterprise application integration (EAI) systems [Stonebraker 2002], and replacing one-size-fits-all database systems with more specialized database engines [Stonebraker and Çetintemel 2005, Stonebraker 2008b].

2002 年,Mike、David DeWitt 和 Jim Gray 对在数据库会议上发表以系统为重点的新颖想法的困难感到遗憾。为了解决这个问题,他们创建了创新数据系统研究会议(http://cidrdb.org/),Mike担任 2003 年第一次会议的程序委员会主席、第二次会议的联合 PC 主席和联合主席第三届主席。正如其网站所述,“CIDR 特别重视创新、基于经验的洞察力和愿景。” 这是我最喜欢的数据库会议,原因有两个:它有非常高密度的有趣想法,这些想法是与过去的重大突破,而且它是单轨的,所以我听到数据管理所有领域的演讲,而不仅仅是我感兴趣的主题正在努力。

In 2002, Mike, David DeWitt, and Jim Gray lamented the difficulty of publishing novel ideas with a system focus in database conferences. To address this problem, they created the Conference on Innovative Data Systems Research (http://cidrdb.org/), with Mike as program committee chair of the first conference in 2003, co-PC chair of the second, and co-general chair of the third. As its website says, “CIDR especially values innovation, experience-based insight, and vision.” It’s my favorite database conference for two reasons: it has a very high density of interesting ideas that are major breaks from the past, and it is single-track, so I hear presentations in all areas of data management, not just topics I’m working on.

1989 年,Mike 和 Hans Schek 主持了一次由数据库研究界的许多领导人参加的研讨会,回顾数据库研究的现状并确定重要的新领域 [Bernstein 等人,2017]。1998a]。从那时起,迈克通过召集组织委员会并获得资金,在确保此类研讨会定期举行方面发挥了重要作用。此类研讨会已有八次,最初每隔几年召开一次,自 1998 年以来每五年召开一次。每个研讨会都会生成一份报告 [Silberschatz 等人。1990,伯恩斯坦等人。1989 年,伯恩斯坦等人。1998b,阿比特博尔等人。2003,格雷等人。2003,阿比特博尔等人。2005,阿格拉瓦尔等人。2008、2009,阿巴迪等人。2014、2016]。该报告旨在帮助数据库研究人员选择工作内容,帮助资助机构了解为什么资助数据库研究很重要,并帮助计算机科学系确定应在哪些领域雇用数据库教师。总的来说,这些报告还提供了该领域焦点变化的历史记录。

In 1989, Mike and Hans Schek led a workshop attended by many leaders of the database research community to review the state of database research and identify important new areas [Bernstein et al. 1998a]. Since then, Mike has been instrumental in ensuring such workshops run periodically, by convening an organizing committee and securing funding. There have been eight such workshops, originally convened every few years and, since 1998, every five years. Each workshop produces a report [Silberschatz et al. 1990, Bernstein et al. 1989, Bernstein et al. 1998b, Abiteboul et al. 2003, Gray et al. 2003, Abiteboul et al. 2005, Agrawal et al. 2008, 2009, Abadi et al. 2014, 2016]. The report is intended to help database researchers choose what to work on, help funding agencies understand why it’s important to fund database research, and help computer science departments determine in what areas they should hire database faculty. In aggregate, the reports also provide a historical record of the changing focus of the field.

Mike 一直在思考如何通过研究新问题和改变流程来改进数据库研究领域。你可以在他的章节(本书第 11 章)中读到他最近关注的许多问题。但我敢打赌,当本书出版时,他将提出其他值得我们关注的新问题。

Mike is always thinking about ways that the database research field can improve by looking at new problems and changing its processes. You can read about many of his latest concerns in his chapter (Chapter 11 in this book. But I’d bet money that by the time this book is published, he’ll be promoting other new issues that should demand our attention.

4

4

观点:2014 年 ACM 图灵奖

Perspectives: The 2014 ACM Turing Award

詹姆斯·汉密尔顿

James Hamilton

学术研究人员研究他们认为有趣的问题,然后发布他们的结果。特别好的研究人员认真倾听行业问题,发现真正的问题,产生相关工作,然后发布结果。真正的学术界巨头会仔细倾听,发现真正的问题,产生相关的结果,构建真正有效的系统,然后发布结果。

Academic researchers work on problems they believe to be interesting and then publish their results. Particularly good researchers listen carefully to industry problems to find real problems, produce relevant work, and then publish the results. True giants of academia listen carefully to find real problems, produce relevant results, build real systems that actually work, and then publish the results.

学术研究中最常见的错误是选择了错误的问题来解决。那些花时间与从业者在一起,倾听他们面临的问题的人,会产生更相关的结果。更重要、更耗时的是,那些构建真实系统的人必须更深入地理解问题,并且必须找到实用的解决方案,实际上可以由凡人实现,并且计算复杂性不是指数级的。构建真正的实施要困难得多,而且要耗时得多,但运行系统才是解决方案真正得到验证的地方。我最喜欢的研究人员既是好的倾听者,也是伟大的建设者。

The most common mistake in academic research is choosing the wrong problem to solve. Those that spend time with practitioners, and listen to the problems they face, produce much more relevant results. Even more important and much more time-consuming, those that build real systems have to understand the problems at an even deeper level and have to find solutions that are practical, can actually be implemented by mortals, and aren’t exponential in computational complexity. It’s much harder and significantly more time-consuming to build real implementations, but running systems are where solutions are really proven. My favorite researchers are both good listeners and great builders.

迈克尔·斯通布雷克 (Michael Stonebraker) 更进一步。他在自己所做的系统研究的基础上建立了整个公司。我们都有过这样的经历:有了一个好主意,但“没有人会听”。也许与你一起工作的人认为他们在 1966 年就尝试过,但失败了。也许一些高级工程师已经宣称这根本就是错误的方法。也许人们只是没有花时间充分理解该解决方案以充分理解其价值。但是,无论出于何种原因,有时您所在的行业或公司仍然没有实施这一想法,即使您知道这是一个很好的想法,并且已经发表了带有详细证明的好论文。这可能会令人沮丧,我遇到过一些人在这个过程中感到有点痛苦。

Michael Stonebraker takes it a step further. He builds entire companies on the basis of the systems research he’s done. We’ve all been through the experience of having a great idea where “nobody will listen.” Perhaps people you work with think they tried it back in 1966 and it was a failure. Perhaps some senior engineer has declared that it is simply the wrong approach. Perhaps people just haven’t taken the time to understand the solution well enough to fully understand the value. But, for whatever reason, there are times when the industry or the company you work for still doesn’t implement the idea, even though you know it to be a good one and good papers have been published with detailed proofs. It can be frustrating, and I’ve met people who end up a bit bitter from the process.

在数据库世界中,曾经有一段时间,任何好的研究想法要想获得成功,唯一的方法就是说服三大数据库公司之一来实施它。这个团队拥有数百万行十年前编写的难以维护的代码,无论客户是否实现你的想法,他们每年都会向他们支付数十亿美元。毫不奇怪,这是一个非常渐进的时期,很多好想法都没有出现。

In the database world, there was a period when the only way any good research idea could ever see the light of day was to convince one of the big three database companies to implement it. This group has millions of lines of difficult-to-maintain code written ten years ago, and customers are paying them billions every year whether they implement your ideas or not. Unsurprisingly, this was a very incremental period when a lot of good ideas just didn’t see the light of day.

斯通布雷克亲切地将这三位创新守门人称为“大象”。他没有浪费时间对大象咆哮和责骂(尽管他也做了一些),他只是建立了成功的公司,证明这些想法足够有效,以至于他们实际上可以成功地对抗大象。他不仅创办了公司,还帮助打破了数据库创新三巨头的锁定,许多数据库初创公司随后蓬勃发展。我们再次经历数据库世界创新的黄金时代。而且,在很大程度上,这个新的创新时期是通过斯通布雷克所做的工作而成为可能的。可以肯定的是,云计算的出现等其他因素也在使变革成为可能方面发挥了重要作用。

Stonebraker lovingly calls this group of three innovation gatekeepers “the Elephants.” Rather than wasting time ranting and railing at the Elephants (although he did some of that as well), he just built successful companies that showed the ideas worked well enough that they actually could sell successfully against the Elephants. Not only did he build companies but he also helped break the lock of the big three on database innovation and many database startups have subsequently flourished. We’re again going through a golden age of innovation in the database world. And, to a large extent, this new period of innovation has been made possible by work Stonebraker did. To be sure, other factors like the emergence of cloud computing also played a significant part in making change possible. But the approach of building real systems and then building real companies has helped unlock the entire industry.

Stonebraker 的思想多年来一直很重要,他对现状的不尊重始终鼓舞人心,数据库研究和行业社区都因他的影响而发生了巨大的变化。由于这一点以及长期以来对数据库行业和研究社区的创新和贡献,Michael Stonebraker 赢得了 2014 年 ACM 图灵奖,这是计算机科学领域最负盛名、最重要的奖项。来自ACM公告:

Stonebraker’s ideas have been important for years, his lack of respect for the status quo has always been inspirational, and the database research and industry community have all changed greatly due to his influence. For this and a long history of innovation and contribution back to the database industry and research communities, Michael Stonebraker has won the 2014 ACM Turing Award, the most prestigious and important award in computer science. From the ACM announcement:

Michael Stonebraker 因其对现代数据库系统底层概念和实践的基本贡献而受到认可。Stonebraker 是许多概念的发明者,这些概念对于使数据库成为现实至关重要,并且几乎用于所有现代数据库系统。他在 INGRES 方面的工作引入了查询修改的概念,用于完整性约束和视图。他后来在 Postgres 上的工作引入了对象关系模型,有效地将数据库与抽象数据类型合并,同时保持数据库与编程语言分离。

Michael Stonebraker is being recognized for fundamental contributions to the concepts and practices underlying modern database systems. Stonebraker is the inventor of many concepts that were crucial to making databases a reality and that are used in almost all modern database systems. His work on INGRES introduced the notion of query modification, used for integrity constraints and views. His later work on Postgres introduced the object-relational model, effectively merging databases with abstract data types while keeping the database separate from the programming language.

Stonebraker 的 Ingres 和 Postgres 实现演示了如何设计支持这些概念的数据库系统;他将这些系统作为开放软件发布,这使得它们能够被广泛采用并将其代码库合并到许多现代数据库系统中。自从 INGRES 和 Postgres 的开创性工作以来,Stonebraker 一直在数据库社区的思想领袖,并拥有许多其他有影响力的想法,包括列存储和科学数据库以及支持在线事务处理和流处理的实现技术。

Stonebraker’s implementations of Ingres and Postgres demonstrated how to engineer database systems that support these concepts; he released these systems as open software, which allowed their widespread adoption and incorporation of their code bases into many modern database systems. Since the path-breaking work on INGRES and Postgres, Stonebraker has continued to be a thought leader in the database community and has had a number of other influential ideas including implementation techniques for column stores and scientific databases and for supporting on-line transaction processing and stream processing.

本章先前发表于 James Hamilton 2015 年 6 月的 Perspectives 博客。http://perspectives.mvdirona.com/2015/06/2014-acm-turing-award/。访问日期:2018 年 2 月 5 日。

This chapter was previously published in James Hamilton’s Perspectives blog in June 2015. http://perspectives.mvdirona.com/2015/06/2014-acm-turing-award/. Accessed February 5, 2018.

5

5

一个产业的诞生;图灵奖之路

Birth of an Industry; Path to the Turing Award

杰里·霍尔德

Jerry Held

那是 1973 年。几年前,我在新泽西州的 RCA Sarnoff 实验室开始了我的职业生涯,当时我从事半导体计算机辅助设计 (CAD) 领域的工作。我非常幸运地获得了萨诺夫奖学金来攻读博士学位。在我选择的大学学习了两年。我的搜索最终让我进入了加州大学伯克利分校,并计划进行 CAD 研究(或者我是这么认为的)。

The year was 1973. Having started my career at RCA Sarnoff Labs in New Jersey a few years earlier, I was working in the field of computer-aided design (CAD) for semiconductors. I was very fortunate to win a Sarnoff Fellowship to pursue a Ph.D. for two years at the university of my choice. My search eventually landed me at UC Berkeley with a plan to do research in CAD (or so I thought).

虽然我当时没有意识到,但这也让我进入了关系数据库行业诞生的产房,在那里,一位名叫 Michael Stonebraker 的大胆助理教授的野心、永不满足的好奇心和几乎不可思议的运气 1 改变了计算机42 年后,他为科学界赢得了计算机领域的最高荣誉——AM 图灵奖。

Although I didn’t realize it at the time, it also landed me in the delivery room for the birth of the relational database industry, where the ambition, insatiable curiosity, and almost uncanny luck1 of an audacious assistant professor named Michael Stonebraker changed computer science and would earn him computing’s highest honor, the A.M. Turing Award, 42 years later.

对我来说,这是一个持续至今的充满活力的专业合作和友谊的故事,也是我近 50 年帮助创建科技公司的过程中更独特的创业故事之一。这也是一个成功地将持续的学术好奇心与永不停歇的创业精神结合起来的故事,这导致了大量的学术工作、一长串成功的学生以及众多公司的创建。

For me, this is the story of a dynamic professional collaboration and friendship that persists to this day and one of the more unique entrepreneurship stories in my nearly 50 years of helping build technology companies. It’s also the story of someone who has managed to couple sustained academic curiosity with unending entrepreneurship, which has led to a huge collection of academic work, a long line of successful students, and the creation of numerous companies.

一个行业的诞生(20 世纪 70 年代)

Birth of an Industry (1970s)

为了支持我在 RCA 的 CAD 工作,我们决定使用一个新的数据库系统,该系统基于一个行业/政府联盟的工作,该联盟的成立旨在促进标准化的数据访问方式:CODASYL(“数据系统语言会议”的缩写)。轮胎公司 BFGoodrich 完成了一个庞大的项目来创建CODASYL 的早期实施。2 RCA 是最早购买该软件的公司之一(后来成为 Cullinet)。我的部分工作是深入研究它的本质并了解这些数据库系统的真正工作原理。

In support of my CAD work at RCA, we decided to use a new database system based on work emanating from an industry/government consortium formed to promote standardized ways to access data: CODASYL (short for “Conference on Data Systems Languages”). BFGoodrich, the tire company, had done a huge project to create an early implementation of CODASYL.2 RCA was one of the first to purchase the software (which later became Cullinet). Part of my job was to dig into the guts of it and learn how these database systems really worked.

在伯克利,我正在寻找博士学位。在他的指导下我可以进行 CAD 方面的研究。当我被介绍给迈克时,我正准备挑选另一位教授,迈克刚刚开始与一位更资深的教授尤金·黄一起从事数据库方面的工作。我记得,只开了一次会面,我就意识到 Mike 和 Gene 正在启动一个非常令人兴奋的项目(称为 Ingres 3),而我想成为其中的一部分。

At Berkeley, I was looking for a Ph.D. advisor under whom I could do research in CAD. I was close to picking another professor when I was introduced to Mike, who was just starting his work in databases with a more senior professor, Eugene Wong. As I recall, it took only one meeting to realize that Mike and Gene were starting a very exciting project (dubbed Ingres3) and I wanted to be part of it.

我现在有了论文导师。迈克的野心(和大胆)正在绽放,因为他从自己自称默默无闻的博士学位转向了他。论文和专业领域(应用运筹学)到更著名和终身教职的材料。吉恩很聪明,也知道在伯克利完成任务的秘诀——这对迈克来说是不可思议的好运的开始。

I now had my thesis advisor. Mike’s ambition (and audacity) were blossoming, as he turned from his own, self-described obscure Ph.D. thesis and area of expertise (applied operations research) to more famous and tenure-making material. Gene was brilliant and also knew the ropes of getting things done at Berkeley—the beginning of what would be an uncanny run of luck for Mike.

安格尔项目的成功有四个关键:

There were four keys to the success of the Ingres project:

• 时机

•  timing

• 团队

•  team

• 竞赛

•  competition

• 平台

•  platform

安格尔——计时

Ingres—Timing

Gene 向 Mike 介绍了 Ted Codd 1970 年发表的开创性论文 [Codd 1970],该论文将关系模型应用于数据库。在 Codd 的论文中,Mike 找到了他的好主意4(更多内容见下文)。当时,每个人都在谈论 Ted 的论文及其潜力,特别是与 CODASYL 和 IBM 的 IMS 相比。Mike 当然读过 CODASYL 报告,但认为该规范过于复杂而驳回。在“迈克尔·斯通布雷克的口述历史”中,[2007 年毕业生] 迈克回忆道:

Gene had introduced Mike to Ted Codd’s seminal 1970 paper [Codd 1970] applying a relational model to databases. In Codd’s paper, Mike had found his Good Idea4 (more on this below). At that point, everyone was buzzing about Ted’s paper and its potential, particularly when compared to CODASYL and IBM’s IMS. Mike had of course read the CODASYL report but dismissed the specification as far too complicated. In the “Oral History of Michael Stonebraker,” [Grad 2007] Mike recalled:

……我不明白为什么你会想做那么复杂的事情,而特德的工作很简单,很容易理解。因此,很明显,反对者已经在说没有博士学位的人了。能够理解 Ted Codd 的谓词演算或关系代数。即使你克服了这个障碍,也没有人能够有效地实施这些东西。

… I couldn’t figure out why you would want to do anything that complicated and Ted’s work was simple, easy to understand. So it was pretty obvious that the naysayers were already saying nobody who didn’t have a Ph.D. could understand Ted Codd’s predicate calculus or his relational algebra. And even if you got past that hurdle, nobody could implement the stuff efficiently.

即使你克服了这个障碍,你也永远无法向 COBOL 程序员教授这些东西。因此,很明显,正确的做法是构建一个具有可访问查询语言的关系数据库系统。因此,吉恩 [Wong] 和我在 1972 年就着手开展这项工作。即使你不是一名火箭科学家,也能意识到这是一个有趣的研究项目。

And even if you got past that hurdle, you could never teach this stuff to COBOL programmers. So it was pretty obvious that the right thing to do was to build a relational database system with an accessible query language. So Gene [Wong] and I set out to do that in 1972. And you didn’t have to be a rocket scientist to realize that this was an interesting research project.

今天的兴奋让我想起了商业互联网的早期(20 多年后的那个时候),发生了很多事情。不仅有我们的 Ingres 项目和 IBM 的 System R 项目5这两个最突出的项目,还有世界各地大学正在进行的十几个其他项目(请参阅第 13 章)。有一些会议和论文。人们对具有关系模型和可访问查询语言的数据库的可能性以及人们可以用这些做的所有事情感到兴奋。

In a headiness that today reminds me of the early days of the commercial Internet (at that point 20-plus years in the future), there was a lot going on. There wasn’t just our project, Ingres, and IBM’s System R project5 the two that turned out to be the most prominent, but a dozen other projects going on at universities around the world (see Chapter 13. There were conferences and papers. There was debate. There was excitement. People were excited about the possibilities of databases with a relational model and accessible query languages, and all the things one might be able to do with these.

这是一个勇敢的新世界。我们几乎不知道,我们正在迈出第一步来建立一个巨大的新数据库行业,并且它将在大约 50 年后蓬勃发展。如果我们早一点开始,我们可能会选择围绕 CODASYL 模型进行研究并进入死胡同。如果起步较晚,我们很可能会错过成为先驱的机会。幸运的是,时机(再次)恰到好处。

It was a brave new world. Little did we know that we were all taking the first steps to build a huge new database industry and that it would be thriving some 50 years later. Had we started a little earlier, we might have chosen to do research around the CODASYL model and reached a dead end. By starting later, we likely would have missed the chance to be pioneers. Timing (again) was luckily just right.

安格尔——团队

Ingres—Team

在我与 Mike 和 Gene 的第一次对话中,我们讨论了我在商业数据库系统方面的经验,并同意构建 Ingres,其基础可能支持现实世界的数据库应用程序。

In my first conversations with Mike and Gene, we discussed my experiences with commercial database systems and agreed to build Ingres with underpinnings that might support real-world database applications.

迈克和吉恩组建了一个学生小组。当时有3名博士。学生:南希·麦克唐纳、卡雷尔·尤瑟菲和我。有两名硕士生:Peter Kreps 和 Bill Zook。还有四名本科生:埃里克·奥尔曼、理查德·伯曼、吉姆·福特和尼克·怀特。鉴于我的行业经验,我最终成为了这个乌合之众组织的首席程序员/项目负责人,这场竞赛旨在构建 Codd 愿景的第一个实现。

Mike and Gene put together a small team of students. At the time, there were three Ph.D. students: Nancy MacDonald, Karel Youseffi, and me. There were two master’s students: Peter Kreps and Bill Zook. And four undergraduate students: Eric Allman, Richard Berman, Jim Ford, and Nick Whyte. Given my industry experience, I ended up being the chief programmer/project lead on this ragtag group’s race to build the first implementation of Codd’s vision.

这是一群拥有不同技能的优秀人士,他们真正团结在一起形成了一个团队。建立这个团队的幸运不容低估,因为与今天不同,当时大多数学生在进入大学学习之前没有接触过计算机编程。

It was a great group of people with diverse skills who really came together as a team. The good fortune of building this team should not be underestimated since, unlike today, most students then had no exposure to computer programming prior to entering university-level study.

当然,如果我们知道自己在做什么,我们就永远不会这么做;对于一群学生来说,它太大了并且具有挑战性。6但许多伟大的事情就是这样发生的:你不会考虑可能的失败,你只会去追求它。(或者,用迈克的话来说:“让它发生。”)

Of course, had we any idea what we were undertaking, we would never have done it; it was far too big and challenging for a group of students.6 But that’s how many great things happen: you don’t think about possible failure, you just go for it. (Or, in Mike-speak: “Make it happen.”)

因为我们面对的是一个全新的领域,所以没有人真正知道关系数据库可以做什么。幸运的是,迈克的学术好奇心不断推动我们思考各种可能性。在编写代码和构建系统的同时,团队还会定期召开会议,由 Mike 主持讨论,例如关于数据完整性或数据安全性的讨论。我们探索了许多想法,撰写了概念论文,并为未来几年的实施奠定了基础,同时专注于让第一个版本发挥作用。

Because we were dealing with a completely new space, none of us really knew what relational databases could do. Fortunately, Mike’s academic curiosity kept constantly pushing us to think about the possibilities. In parallel with writing code and building the system, the team would have regular meetings in which Mike would lead a discussion, for example, on data integrity or data security. We explored many ideas, wrote concept papers, and laid the groundwork for future years of implementations while concentrating on getting the first version working.

安格尔——竞争

Ingres—Competition

尽管世界各地的大学正在进行许多其他优秀的研究项目,但它们几乎完全专注于学术,因此并没有为支持实际应用的系统带来太多竞争。

Although there were many other good research projects going on at universities around the world, they were almost entirely academically focused and therefore didn’t create much competition for a system that could support real applications.

与此同时,在距伯克利仅一小时车程的地方,IBM 创建了 System R 项目,并组建了一支由研究人员和计算机科学家组成的一流、资金充足的团队,其中包括已故的才华横溢的吉姆·格雷(Jim Gray):一个了不起的人,迈克和我后来都成为了他的同事。终生的朋友和合作者。这才是真正的竞争!

Meanwhile, just an hour’s drive from Berkeley, IBM had created the System R project and assembled a stellar, well-funded team of researchers and computer scientists including the late, wildly talented Jim Gray: a terrific guy with whom both Mike and I later became lifelong friends and collaborators. This was real competition!

尽管 Ted Codd 在 IBM 工作,但他与两个团队一起度过了时光,并跟上正在取得的快速进展。尽管我们明显处于竞争状态,但整个 IBM System R 团队仍与我们密切合作。

Although Ted Codd worked for IBM, he spent time with both teams and kept up with the rapid progress being made. The whole IBM System R team was very collaborative with us even though we were clearly competing.

那么,为什么在 System R 工作问世之前,Ingres 就被构建并广泛分发了呢?再次,我们很幸运,IBM 遭遇了“创新者的困境”,正如 Clayton Christensen 所著的同名名著 [Christensen 1997] 中所描述的那样。

So why is it that Ingres was built and widely distributed before the System R work saw the light of day? Again, it was our good luck that IBM suffered from “the innovator’s dilemma,” as described in the famous book of the same name by Clayton Christensen [Christensen 1997].

当时,IBM 主导了计算的各个方面——计算机系统、软件、操作系统,当然还有数据库。IBM 拥有 IMS,这是当时业界的数据库系统。如果您不需要数据库系统,他们可以使用 ISAM 和 VSAM 文件系统来存储其他所有内容。IBM 显然认为,这种新的“关系型”数据库模型占据主导地位并可能破坏其主导地位不符合其利益。

At the time, IBM dominated every aspect of computing—computer systems, software, operating systems and, yes, databases. IBM had IMS, which was THE database system in the industry, period. If you didn’t need a database system, they had ISAM and VSAM, file systems to store everything else. IBM clearly believed that it was not in its interest for this new “relational” database model to take hold and possibly disrupt its dominant position.

因此,在许多大学项目和资金雄厚的 IBM 项目之间,Ingres 最终成为第一个能够支持实际应用程序的广泛可用的关系数据库。

And so, between the many university projects and the very well-funded IBM project, Ingres wound up as the first widely available relational database capable of supporting real-world applications.

安格尔——平台

Ingres—Platform

也许我们最大的幸运就是选择了构建 Ingres 的平台。Mike 和 Gene 获得了 UNIX 的第一个副本,这在当时是一个完全未知的操作系统。我们几乎不知道 UNIX 会变得如此流行。Ingres 在世界各地关注 UNIX,许多人开始使用我们的工作。我们获得了很多用户,他们提供了很好的反馈并加速了我们的进步。

Maybe our greatest stroke of luck was the choice of platform for building Ingres. Mike and Gene were able to obtain one of the first copies of UNIX, which at the time was a completely unknown operating system. Little did we know that UNIX would become wildly popular; Ingres followed UNIX around the world, and lots of people started using our work. We acquired lots of users, which provided great feedback and accelerated our progress.

迈克做出的另外两个关键决定将改变安格尔的命运。首先,他使用许可证将学术代码放入公共领域,该许可证广泛地允许其他人使用该代码并在其基础上进行构建(使他意外地成为“开源”运动的先驱。)7 其次,在 1980 年,他吉恩,Larry Rowe 创建了一家公司 Relational Technology, Inc. (RTI) 来将 Ingres 商业化,他相信确保 Ingres 继续繁荣并为用户大规模工作的最佳方式是通过商业开发来支持它。

Mike made two other key decisions that would change Ingres’ fortunes. First, he put the academic code into the public domain using a license that broadly enabled others to use the code and build on it (making him an accidental pioneer in the “open source” movement.)7 Second, in 1980, he, Gene, and Larry Rowe created a company, Relational Technology, Inc. (RTI) to commercialize Ingres, believing that the best way to ensure that Ingres continued to prosper and work at scale for users was to back it with commercial development.

安格尔模型(开源学术工作导致商业实体)成为迈克余下职业生涯的公式(第 7 章)。

The Ingres model—open-source academic work leading to a commercial entity—became the formula for the rest of Mike’s career (Chapter 7).

充满竞争的青春期(1980 年代和 1990 年代)

Adolescence with Competition (1980s and 1990s)

与此同时,我于 1975 年带着新获得的博士学位离开了伯克利。我继续帮助创办 Tandem Computers,利用我在伯克利学到的很多知识,构建了一个具有一些独特新属性的商业数据库(导致了 NonStop SQL 8)。迈克的商业好奇心被激发了,他来为我们提供了一些咨询服务——这是他第一次看到初创公司的运作。我们在 Tandem 建立了一支优秀的团队,包括 Jim Gray、Franco Putzolu、Stu Schuster 和 Karel Youssefi 等人,并且正在探索数据库世界的其他方面,例如事务、容错和无共享架构。我认为这段经历帮助迈克尝到了创业的滋味,后来他也因此而闻名。

Meanwhile, I had left Berkeley in 1975 with my newly minted Ph.D. I went on to help start Tandem Computers, taking a lot of what I learned at Berkeley and building out a commercial database with some unique new properties (leading to NonStop SQL8). His commercial curiosity piqued, Mike came and did some consulting for us—his first exposure to seeing a startup in action. We had built a great team at Tandem, including people like Jim Gray, Franco Putzolu, Stu Schuster, and Karel Youssefi, and were exploring other aspects of the database world like transactions, fault tolerance, and shared nothing architectures. I think that this experience helped Mike get a taste of entrepreneurship, for which he would later become well known.

与甲骨文竞争

Competing with Oracle

作为 Ingres 项目的一部分,我们创建了 QUEL 语言 [Held 等人。1975]允许用户查询和更新数据库。IBM 团队定义了一种不同的语言,最初称为 SEQUEL,但后来缩写为 SQL。在 Mike 与人共同创立 RTI 来将 Ingres 商业化的几年前,Larry Ellison 创办了 Oracle 公司,其想法是在 IBM 将产品推向市场之前实施 IBM 的 SQL 查询语言。

As a part of the Ingres project, we had created the QUEL language [Held et al. 1975] to allow users to query and update the database. The IBM team had defined a different language originally called SEQUEL but later shortened to SQL. A few years before Mike had co-founded RTI to commercialize Ingres, Larry Ellison had started Oracle Corporation with the idea of implementing IBM’s SQL query language before IBM brought a product to market.

由于 IBM 伟大的技术被困在墙后,RTI(使用 Ingres 的 QUEL 语言)和 Oracle(使用 SQL)成为提供商业关系数据库系统的主要竞争者。在比赛初期,很多观察家都会说安格尔拥有最好的技术;然而,迈克学到的一个重要教训是“最好的技术并不总是获胜。” Oracle 最终通过采用 IBM 的 SQL 查询语言获得了显着的优势;它可能不如 QUEL,9但 IBM 的声誉和创建事实上的行业标准的能力赢得了胜利。此外,迈克曾经(并且可能仍然)非常关注技术,而埃里森则能够通过强大的销售和营销赢得许多客户。

With IBM’s great technology stuck behind the wall, RTI (with Ingres’ QUEL language) and Oracle (with SQL) were the leading contenders to provide a commercial relational database system. In the early days of competition, many observers would say that Ingres had the best technology; however, a major lesson for Mike to learn was that “the best technology doesn’t always win.” Oracle ended up with a significant advantage by adopting IBM’s SQL query language; it might not have been as good as QUEL,9 but IBM’s reputation and ability to create both de facto and industry standards won the day. Also, Mike had (and probably still has) a heavy focus on technology, while Ellison was able to win over many customers with strong sales and marketing.

本书中详细记录了 Ingres 商业成果的更多细节。10

Many more details of the commercial Ingres effort are well documented throughout this book.10

(再次)与 Oracle 竞争

Competing with Oracle (Again)

与此同时,回到伯克利后,迈克的学术好奇心仍在安格尔之后继续存在。他启动了 Postgres 项目来探索复杂和用户定义的数据类型——这是一个长达十年的项目,Stonebraker 的学生/合作者 Joe Hellerstein 称之为“Stonebraker 最雄心勃勃的项目——他为构建一个通用的数据库系统所做的巨大努力。 ” 11

In the meantime, back at Berkeley, Mike’s academic curiosity continued post-Ingres. He started the Postgres project to explore complex and user-defined data types—a decade-long project that Stonebraker student/collaborator Joe Hellerstein calls “Stonebraker’s most ambitious project—his grand effort to build a one-size-fits-all database system.”11

按照他在 Ingres 中使用的五步模型,Mike 于 1992 年成立了 Illustra Information Technologies, Inc.,一家将 Postgres 商业化的公司,并成为其首席技术官。几年后,Illustra 被 Informix 收购,其主要竞争对手是 Oracle。

Following the five-step model he had used for Ingres, Mike in 1992 formed Illustra Information Technologies, Inc., a company to commercialize Postgres, becoming its Chief Technology Officer. After a few years, Illustra was acquired by Informix, whose major competitor was Oracle.

1993 年,我离开了 Tandem,去了 Mike 所说的“黑暗面”:Oracle,我在那里负责数据库团队。收购 Illustra 后,Mike 成为 Informix 的 CTO,我们成了竞争对手。在此期间,媒体上充斥着关于我们关于扩展关系数据库模型的最佳方式的经常激烈辩论的文章。尽管 Postgres/Illustra 对象关系技术可能比 Oracle 8 对象关系技术更好,但 Mike 再次认识到最好的技术并不总是获胜。

In 1993, I had left Tandem and gone to what Mike would call “the Dark Side”: Oracle, where I ran the database group. With the Illustra acquisition, Mike had become CTO of Informix and we became competitors. During this time, the press was full of articles covering our often-heated debates on the best way to extend the relational database model. Although the Postgres/Illustra Object-Relational technology may have been better than the Oracle 8 Object-Relational technology, Mike again learned that the best technology doesn’t always win.

在这一切过程中,迈克和我仍然是朋友,这为我们关系的下一阶段做好了准备:商业合作者。

Through it all, Mike and I remained friends—which queued up the next phase of our relationship: commercial collaborators.

成熟与多样性(2000 年代和 2010 年代)

Maturity with Variety (2000s and 2010s)

20 世纪 80 年代和 90 年代是构建通用数据库系统(Ingres 和 Postgres)并与 Oracle 和其他通用系统进行广泛竞争的时期。然而,随着新千年的到来,Mike 长期专注于擅长执行特定任务的专用数据库系统。这是他的“一刀切”时期 [Stonebraker 和 (Çetintemel 2005, Stonebraker et al. 2007a]。

The 1980s and 1990s had been a period of building general-purpose database systems (Ingres and Postgres) and competing across a broad front with Oracle and other general-purpose systems. As the new millennium dawned, however, Mike entered into a long period of focus on special-purpose database systems that would excel at performing specific tasks. This is his “one size does not fit all” period [Stonebraker and (Çetintemel 2005, Stonebraker et al. 2007a].

流数据、复杂的科学数据、快速分析和高速事务处理只是迈克系统构思、开发和技术转让五步法以及他创建公司的持续雄心所急需的几个领域。

Streaming data, complex scientific data, fast analytics, and high-speed transactional processing were just a few of the areas crying out for Mike’s five-step approach to system ideation, development, and technology transfer—and his continued ambition to build companies.

2000 年,迈克从伯克利退休,向东前往麻省理工学院 CSAIL 的新运营基地(又一个新领域,因为麻省理工学院当时没有数据库研究小组可言),并在大波士顿为家人安置了新家地区(靠近他心爱的新罕布什尔州怀特山)。

In 2000, Mike retired from Berkeley and headed east to a new base of operations at MIT CSAIL (yet another new frontier, as MIT had no database research group to speak of at the time) and a new home for his family in the Greater Boston area (close to his beloved White Mountains in New Hampshire).

正如本书其他部分所述,12 Mike 继续帮助麻省理工学院创建了一个世界一流的数据库研究小组,同时领导了每个专业领域的研究。他还借鉴了其他地区大学的强大数据库专业知识,例如布兰代斯大学(米奇·切尔尼亚克)和布朗大学(斯坦·兹多尼克)。

As described elsewhere in this book,12 Mike proceeded to help create a world-class database research group at MIT while leading research into each of these specialty areas. He also drew on the formidable database expertise at other area universities, such as Brandeis (Mitch Cherniack) and Brown University (Stan Zdonik).

大约在迈克搬到东部的同时,我离开了甲骨文,专注于我(最终)意识到我最喜欢做的事情:帮助公司起步。我在凯鹏华盈工作了一年,并在风险投资领域获得了世界一流的教育。在帮助创办了几家凯鹏华盈支持的公司后,我决定我自己作为董事会成员和一长串初创公司(和大型上市公司)的导师。这最终让我回到了迈克身上。

Around the same time that Mike was moving east, I left Oracle to concentrate on what I (finally) realized I loved to do most: help companies get off the ground. I spent a year at Kleiner Perkins and got a world-class education in the venture capital world. After helping to start a couple of Kleiner-backed companies, I set out on my own as a board member and mentor to a long list of startups (and large public companies). This eventually led me back to Mike.

维蒂卡

Vertica

迈克和我分别与一家初创公司进行了一些咨询,该初创公司试图构建用于高速分析的专用数据库系统。咨询工作的时间都很短,因为我们都不喜欢该公司的做法。然而,迈克对这个普遍问题很感兴趣,并在麻省理工学院启动了便利店项目 [Stonebraker 等人。2005a] 调查高速分析数据库使用列存储的可能性(相对于传统的行存储方法)。与往常一样,Mike 在担任创始首席执行官的 Andy Palmer 的帮助下将学术项目转变为一家公司(Vertica Systems, Inc.)。132006 年,我加入他们,担任 Vertica 董事长。Mike 和 Andy 能够组建一支优秀的团队,Vertica 产品能够比通用数据库系统产生 100 倍的性能优势,并证明“一刀切”。Vertica在2011年被惠普收购,取得了相当不错的成果,成为该公司的“大数据”分析平台。14

Both Mike and I had separately done some consulting with a startup that was attempting to build a special-purpose database system for high-speed analytics. The consulting engagements were both very short, as neither of us was enamored with the company’s approach. Mike, however, was intrigued with the general problem and started the C-Store project at MIT [Stonebraker et al. 2005a] to investigate the possible use of column stores for high-speed analytic databases (versus the traditional row store approach). As usual, Mike turned the academic project into a company (Vertica Systems, Inc.) with the help of Andy Palmer as the founding CEO.13 In 2006, I joined them as chairman of Vertica. Mike and Andy were able to build a great team and the Vertica product was able to produce 100x performance advantage over general-purpose database systems and demonstrate that “one size doesn’t fit all.” Vertica had a reasonably good outcome as it was acquired by HP in 2011, becoming the company’s “Big Data” analytics platform.14

伏特数据库

VoltDB

当我们构建 Vertica 时,Mike 在 MIT 启动了 H-Store 研究项目,其想法是构建专门用于事务处理的高性能行存储(请参阅第 19 章“H- Store /VoltDB”)。为了推出商业版本,我们在 Vertica 内部孵化了一个团队。尽管专门的交易处理系统有明显的好处,但高性能交易系统的市场比分析方面的市场要有限得多,而且对于初创公司来说,追求这些典型的关键任务应用程序要困难得多。

While we were building Vertica, Mike started the H-Store research project at MIT with the idea of building a high-performance row store specialized for transaction processing (see Chapter 19 “H-Store/VoltDB”). To get a commercial version off the ground, we incubated a team inside of Vertica. Although there were clear benefits to a specialized system for transaction processing, the market for a very high-performance transaction system was much more limited than on the analytics side and it was significantly more difficult for a startup to pursue these typically mission-critical applications.

VoltDB, Inc. 于 2009 年从 Vertica 中分离出来,并继续取得了一定的成功。有趣的是,在过去的三年里,我一直担任 MemSQL 的董事长,这家公司以内存行存储(类似于 VoltDB)开始,添加了列存储(类似于 Vertica),并且在新兴的实时数据库市场。

VoltDB, Inc. was spun out of Vertica in 2009 and has continued with moderate success. Interestingly, over the past three years, I have been chairman of MemSQL, a company that started with an in-memory row store (à la VoltDB), added a column store (à la Vertica), and has had significant success going after the emerging realtime database market.

塔姆尔

Tamr

迈克意识到,他几十年来开发的所有数据库系统的有效性都取决于进入其中的数据的质量。他在麻省理工学院启动了 Data Tamer 项目 [Stonebraker 等人。2013b]——与QCRI、(卡塔尔计算研究所)布兰代斯大学和布朗大学的合作者——研究机器智能和人类数据专家的结合如何有效地统一为不断扩大的数据库提供数据的不同数据源。当该技术开始商业化时,迈克再次任命安迪·帕尔默为首席执行官,我担任董事长。Tamr, Inc. 成立于 2013 年。

Mike realized that the effectiveness of all of the database systems that he had worked on over the decades depended on the quality of the data going into them. He started the Data Tamer project at MIT [Stonebraker et al. 2013b]—with collaborators at QCRI, (Qatar Computing Research Institute) Brandeis University, and Brown University—to investigate how a combination of Machine Intelligence and human data experts could efficiently unify the disparate data sources that feed ever-expanding databases. When it came time to start commercialization of the technology, Mike turned again to Andy Palmer as CEO and me as chairman. Tamr, Inc., was founded in 2013.

Tamr 与之前的 Stonebraker 初创公司相似,但也有很大不同。Tamr 系统15 的相似之处在于它遵循 Mike 将学术研究转化为商业运营的基本公式,但针对为 DBMS准备数据而不是DBMS 的产品进行了修改。它与除原始 Ingres 项目之外的所有项目的不同之处在于,它探索了一个全新的问题(企业数据统一),而不是对数据库技术不同方面的增量研究。我们仍处于 Tamr 旅程的中间;但很多大型企业却取得了不俗的成绩,公司发展势头良好。

Tamr is similar but also very different from previous Stonebraker startups. The Tamr system15 is similar in that it followed Mike’s basic formula for turning academic research into a commercial operation, but modified for a product that prepares data for a DBMS instead of being the DBMS. It differs from all but the original Ingres project in that it explored a completely new problem (enterprise data unification) instead of being an incremental investigation of different aspects of database technology. We’re still in the middle of the Tamr journey; however, many large enterprises are realizing great results and the company is growing nicely.

底线

The Bottom Line

除了帮助保持关系数据库行业充满活力、“诚实”并专注于客户不断变化的数据管理问题的雄心之外,迈克(在我看来)长期以来一直致力于赢得图灵奖——这是业界的杰出奖项。计算机科学世界。在学术环境中探索如此多不同的研究领域,然后在商业环境中证明这些想法时,他以最独特的方式稳步构建自己的简历以实现这一目标。

Beyond his ambition to help keep the relational database industry vibrant, “honest,” and focused on customers’ evolving data management problems, Mike (in my view) was long driven by the ambition to win the Turing Award—the pre-eminent award in the computer science world. In exploring so many different research areas in an academic setting and then proving these ideas in a commercial setting, he steadily built his résumé in a most unique manner to achieve this goal.

在他所有的商业企业中,Mike 都担任首席技术官并领导高级产品架构。但同样重要的是,在担任首席技术官期间,他有机会与客户进行多次深入互动,并了解产品可以做什么和不能做什么。这种洞察力不仅导致了当前公司产品的改进,也带来了新的学术研究的想法,完成了导致他下一次创业努力的良性循环。

In all of his commercial ventures, Mike has taken the role of Chief Technical Officer and led the high-level product architecture. But as importantly, in that CTO role, he has had the opportunity to have many deep interactions with customers and gain an understanding of what the products can and cannot do. This insight has led not only to improvements in the current company’s products but also to ideas for new academic research, completing a virtuous circle leading to his next entrepreneurial efforts.

即使在获得 2014 年图灵奖并刚刚获得了进入了他的 74 岁生日(在我写这篇文章的时候)。尽管还有其他人在建立大型、极其成功的数据库公司方面取得了更大的成功(尤其是迈克的“伙伴”拉里·埃里森),并且可能有人写了更多的学术数据库论文(但我不确定是谁),但是在软件世界最重要的部分之一:数据库系统领域,毫无疑问,没有人能与迈克相比,成为一位兼具学者和企业家的人。

Mike continues to imagine, inquire, investigate, innovate, inspire, and (yes) irritate today, even after having received the 2014 Turing Award and having just entered his 74th year (at the time I write this). Although there are others who have been more successful in building large, extremely successful database companies (especially Mike’s “buddy” Larry Ellison), and there may be someone who has written more academic database papers (but I’m not sure who), there is certainly no one who comes close to Mike as a combined academic and entrepreneur in what has been one of the most important parts of the software world: database systems.

那么,迈克,下一步是什么?

So, Mike, what’s next?

1 . 正如迈克本人会告诉你的那样。

1. As Mike himself will tell you.

2 . CODASYL 原本应该成为与 IMS(IBM 市场主导的分层数据库系统)竞争的行业标准。

2. CODASYL was supposed to be the industry standard to compete against IMS, IBM’s market-dominant, hierarchical database system.

3 . 对于其最初的预期应用,交互式图形检索系统。

3. For its originally intended application, an INteractive Graphics REtrieval System.

4 . 有关 Mike 如何发现他的想法的更多信息,请参阅第 10 章

4. For more information on how Mike finds his ideas, see Chapter 10.

5 . 有关 IBM System R 项目的更多信息,请阅读IBM 研究员和“DB2 之父”Don Haderle 撰写的第 35 章。

5. For more on the IBM System R project, read Chapter 35 by IBM Fellow and “father of DB2” Don Haderle.

6 . 正如迈克在接受玛丽安·温斯莱特采访时所肯定的那样 [M. 温斯莱特。2003 年 6 月,迈克尔·斯通布雷克 (Michael Stonebraker) 发表讲话。ACM SIGMOD 记录,32(2)]。

6. As Mike affirmed in an interview with Marianne Winslett [M. Winslett. June 2003. Michael Stonebraker speaks out. ACM SIGMOD Record, 32(2)].

7 . 有关此主题的更多信息,请参阅第 12 章

7. For more on this topic, see Chapter 12.

8 . http://en.wikipedia.org/wiki/NonStop_SQL(上次访问时间:2018 年 1 月 4 日)。

8. http://en.wikipedia.org/wiki/NonStop_SQL (Last accessed January 4, 2018).

9 . 有关于此的有趣观点,请阅读Don Haderle 撰写的第 35 章,他参与了 IBM System R 研究项目。

9. For an interesting view on this, read Chapter 35 by Don Haderle, who worked on the IBM System R research project.

10 . 具体请参见第 15 章

10. In particular, see Chapter 15.

11 . 有关 Postgres 的详细信息(包括它对现代 RDBMS 行业的许多持久贡献及其代码行),请阅读 Joe Hellerstein 的详细第 16 章

11. For details about Postgres—including its many enduring contributions to the modern RDBMS industry and its code lines—read Joe Hellerstein’s detailed Chapter 16.

12 . 请参阅第 1 章,按系统划分的研究贡献;第 7.1 部分来自构建系统的贡献,描述了当前或新兴用途的六种以上特殊用途系统。

12. See Chapter 1, Research Contributions by System; and Part 7.1 Contributions from building systems describing more than a half-dozen special-purpose systems in current or emerging use.

13 . 有关该公司成功商业实施该技术的更多信息,请参阅第 18 章。

13. See Chapter 18, for more on the company’s successful commercial implementation of the technology.

14 . 要了解 Stonebraker 初创公司的内部情况,请阅读 Andy Palmer 的第 8 章

14. For an inside look at a Stonebraker startup, read Andy Palmer’s Chapter 8.

15 . 有关 Tamr 系统平台的更多信息,请阅读第 21 章

15. For more on the Tamr system platform, read Chapter 21.

6

6

从 50 年的角度看迈克

A Perspective of Mike from a 50-Year Vantage Point

大卫·J·德威特

David J. DeWitt

1970 年秋季——密歇根大学

Fall 1970—University of Michigan

1970 年 9 月,当我抵达安娜堡开始攻读计算机工程研究生学位时,距离该市发生一系列针对越南战争的大规模抗议活动仅几个月,这些抗议活动一直持续到战争最终结束。去年参加了芝加哥和华盛顿特区的各种抗议游行后,我感到宾至如归。事实上,我很幸运能够进入研究生院,因为那时研究生院的军事延期已经终止,而且我的彩票号码足够低,几乎可以保证我将前往越南的稻田。如果我的陆军体检没有失败,我很可能会去越南而不是密歇根。

When I arrived in Ann Arbor in September 1970 to start a graduate degree in computer engineering, the city was just months removed from a series of major protests against the Vietnam War that would continue until the war was finally ended. Having spent the previous year at various protest marches in Chicago and Washington, D.C., I felt right at home. I was actually fortunate to be in graduate school at all because military deferments for graduate school had been terminated by then and my lottery number was sufficiently low to pretty much guarantee that I was headed for the rice paddies of Vietnam. Had I not failed my Army physical, I could easily have been headed to Vietnam instead of Michigan.

在 20 世纪 60 年代末,很少有大学提供计算机科学 (CS) 本科学位。我实际上是化学专业的,但本科时参加过三到四次计算机科学研讨会。我绝对没有做好接受严格的研究生课程的准备。我作为研究生第一个学期的课程之一是计算机体系结构入门课程。所以,我是一个害怕、准备不足的一年级研究生,而这个高大的家伙是我们的助教。那个助教是一个名叫 Mike Stonebraker 的人,他对我接下来 48 年的职业生涯产生了巨大的影响。令人大开眼界的是,我发现我的助教对这个学科的了解还不如我作为一名即将入学的研究生,尽管他很高。

In the late 1960s, very few universities offered undergraduate degrees in computer science (CS). I was actually a chemistry major but had taken three or four CS seminars as an undergraduate. I was definitely not well prepared to undertake a rigorous graduate program. One of the classes I took my first semester as a graduate student was an introductory computer architecture course. So, there I was, a scared, ill-prepared first-year graduate student with this towering guy as our TA (teaching assistant). That TA turned out to be a guy named Mike Stonebraker who would turn out to have a huge impact on my professional career over the next 48 years. It was eye opening to discover that my TA knew less about the subject than I did as an incoming graduate student, despite his height. This gave me hope that I could successfully make the transition to graduate school.

第二年春天,迈克在阿奇·内勒的指导下完成了论文,然后前往伯克利。在研究生中,有传言说迈克和阿奇为了他的论文发生了激烈的争斗。我从来不知道那些战斗是否是内容或风格(Arch 在写作方面是一个真正的坚持者)。然而,从那时起直到 1994 年退休,Arch 从未再招收过研究生。回想起来,仍然很难理解一个论文探索随机马尔可夫链数学、没有构建软件工件经验且能力有限的研究生是如何做到这一点的。编程能力(我相信实际上没有)最终将帮助启动一个全新的计算机科学研究领域以及一个每年 50B 美元的行业。

The following Spring, Mike finished his thesis under Arch Naylor and headed off to Berkeley. Among the graduate students, there were rumors of epic battles between Mike and Arch over his thesis. I never learned whether those battles were over content or style (Arch was a real stickler when it came to writing). However, Arch never took on another graduate student between then and when he retired in 1994. Looking back, it is still hard to fathom how a graduate student whose thesis explored the mathematics of random Markov chains and who had no experience building software artifacts and limited programming ability (actually none, I believe) was going to end up helping to launch an entire new area of CS research as well as a $50B/year industry.

从 1971 年到 1976 年,当我攻读博士学位时,迈克和我失去了联系,但事实证明,我们俩都在这段时间发现了数据库系统。我参加了一门极其无聊的课程,其中涵盖了 IMS 和 CODASYL 数据模型及其低级过程操作语言。据我记得,课堂上从未提到过关系数据模型。几乎同时,1973 年,Mike 和 Gene Wong 启动了 Ingres 项目,这显然是一个巨大的风险,因为 (a) Mike 是 EE(电气工程)系的一名未获得终身教职的助理教授,(b) 构建关系数据库系统的技术完全被未知,并且 (c) 他确实对构建大型软件工件一无所知。此外,Ingres 第一个版本的目标平台是 PDP 11/45,其 16 位地址空间需要将系统构建为四个独立的进程。迈克的第一个“大”赌注的巨大成功显然为他在接下来的 50 年中进行的其他大赌注奠定了基础。

From 1971–1976 while I was working on my Ph.D., Mike and I lost contact with one another, but it turned out that both of us had discovered database systems during the intervening period. I took an incredibly boring class that covered the IMS and CODASYL data models and their low-level procedural manipulation languages. As far as I can recall the class never mentioned the relational data model. Almost simultaneously, in 1973 Mike and Gene Wong started the Ingres project—clearly a huge risk since (a) Mike was an untenured assistant professor in an EE (electrical engineering) department, (b) the techniques for building a relational database system were totally unknown, and (c) he really did not know anything about building large software artifacts. Furthermore, the target platform for the first version of Ingres was a PDP 11/45 whose 16-bit address space required building the system as four separate processes. The highly successful outcome of Mike’s first “big” bet clearly laid the groundwork for the other big bets he would make over the course of the next 50 years.

1976年春天,我完成了博士学位。他拥有计算机体系结构博士学位,并在威斯康星州担任助理教授。开课前一个月,我被分配设计和教授一门关于数据库系统的新课程,并强烈鼓励我将研究项目的重点从计算机体系结构转向数据库系统,尽管我对关系数据库这个新领域一无所知系统。幸运的是,我认识一个“高个子”,他在六年内将伯克利确立为这一新研究领域的学术领袖。

In the Spring of 1976, I finished my Ph.D. in computer architecture and took a job as an assistant professor at Wisconsin. A month before classes were to start, I was assigned to design and teach a new class on database systems and strongly encouraged to switch the focus of my research program from computer architecture to database systems even though I knew nothing about this new area of relational database systems. Fortunately, I knew the “tall guy” who, in the intervening six years, had established Berkeley as the academic leader of this new field of research.

1976 年秋季——威斯康星州

Fall 1976—Wisconsin

虽然迈克和我在密歇根并不是好朋友,但出于我仍然不明白的原因,一旦我们重新联系,迈克就决定成为我的导师。他为我提供了一份早期的 Ingres 著作供课堂使用,并邀请我参加 1977 年 5 月举行的第二届伯克利分布式数据管理和计算机网络研讨会(第 12 章。他向我介绍了当时数据库系统领域的一个很小的研究人员社区。这是我第一次接触数据库研究社区。这也是 Mike 发表第一篇关于 Ingres 分布式版本的论文的地方[Stonebraker 和 Neuhold 1977]。鉴于 Mike 和 Gene 刚刚让 Ingres 开始工作,尝试在那个时间范围内构建 Ingres 的分布式版本是一个巨大的挑战(Unix 中没有“现成的”网络堆栈,直到 Bill Joy 发布了 BSD Unix) 20 世纪 80 年代初的 VAX)。虽然这个项目并没有产生与 Ingres 相同的商业影响(IBM 的竞争对手 R* 也没有),但访问控制、查询处理、网络和并发控制方面的众多技术挑战为学术研究界需要解决。这两个项目以及 System R 对我们领域的影响不可低估。通过展示基于关系数据模型的 DBMS 的可行性以及 SIGMOD 和 VLDB 会议上的技术领导力,

While Mike and I had not been good friends at Michigan, for reasons I still do not understand Mike decided to become my mentor once we reconnected. He provided me with an early copy of Ingres to use in the classroom and invited me to attend the Second Berkeley Workshop on Distributed Data Management and Computer Networks in May 1977 (Chapter 12). where he introduced me to what was then a very small community of researchers in the database systems field. It was my first exposure to the database research community. It was also the venue where Mike published his first paper on a distributed version of Ingres [Stonebraker and Neuhold 1977]. Given that Mike and Gene had just gotten Ingres to work, attempting to build a distributed version of Ingres in that timeframe was a huge challenge (there were no “off-the-shelf” networking stacks in Unix until Bill Joy released BSD Unix for the VAX in the early 1980s). While this project did not turn out to have the same commercial impact that Ingres did (nor did R*, the IBM competitor), the numerous technical challenges in access control, query processing, networking, and concurrency control provided a rich set of challenges for the academic research community to tackle. The impact of these two projects, and that of System R, on our field cannot be underestimated. By demonstrating the viability of a DBMS based on the relational data model and through technical leadership at SIGMOD and VLDB conferences, the Ingres and System R projects provided a framework for the nascent database research community to explore, a voyage that continues to this day.

随着终身教职时间的流逝,我对博士学位的研究方向感到厌倦。在我的论文(处理器级功能级并行性的开发)中,我决定尝试构建一个并行数据库系统,并于 1978 年初启动了 DIRECT 项目 [DeWitt 1979a, 1979b]。虽然其他人已经开始探索并行处理关系查询的想法,特别是多伦多的 RAP 项目、犹他州的 RARES 项目、佛罗里达州的 CASSM 项目和俄亥俄州立大学的 DBC 项目,但我有一个关键优势:我知道他是个“高个”家伙,当时已经广泛研究了 Ingres 源代码。

With the tenure clock ticking and bored with the line of research from my Ph.D. thesis (exploitation of functional-level parallelism at the processor level), I decided to try to build a parallel database system and launched the DIRECT project in early 1978 [DeWitt 1979a, 1979b]. While others had started to explore this idea of processing relational queries in parallel—notably the RAP project at Toronto, the RARES project at Utah, the CASSM project at Florida and the DBC project at Ohio State University—I had one critical advantage: I knew the “tall” guy and had, by that time, studied the Ingres source code extensively.

由于获得 Ingres 的副本需要签署站点许可证并支付少量费用(50 美元)来支付制作磁带的成本,因此有些人会质疑 Ingres 是否是第一个开源软件(第 12 章)。我的记忆是我在 1976 年秋天收到的 Ingres 副本包含了 Ingres 的源代码——比 Berkeley Unix 的第一个版本早了至少两年 [Berkeley Software Distribution nd, BSD Licenses nd]。显然,版权已经被历史遗忘了。 ,如果有的话,早期的 Ingres 版本有。但是,我们仍然可以访问 Ingres 版本 7.1(日期为 1981 年 2 月 5 日)的源代码,该版本作为 BSD 4.2 Unix 版本的一部分进行分发。1 只有两个文件其中有版权声明:parser.h 和教程.nr。这两个文件中的版权声明转载如下:

Since obtaining a copy of Ingres required signing a site license and paying a small fee ($50) to cover the cost of producing a tape, some will dispute whether or not Ingres was the first open source piece of software (Chapter 12. My recollection is that the copy of Ingres that I received in the Fall of 1976 included the source code for Ingres—predating the first release of Berkeley Unix by at least two years [Berkeley Software Distribution n.d., BSD licenses n.d.]. Apparently lost to history is what copyright, if any, the early Ingres releases had. We do, however, still have access to the source code for Ingres Version 7.1 (dated February 5, 1981), which was distributed as part of the BSD 4.2 Unix release.1 Only two files have copyright notices in them: parser.h and tutorial.nr. The copyright notice in these two files is reproduced below:

/*

/*

** COPYRIGHT**

** COPYRIGHT **

**

**

**The Regents of the University of California

** The Regents of the University of California

**

**

**1977

** 1977

**

**

**This program material is the property of the

** This program material is the property of the

**Regents of the University of California and

** Regents of the University of California and

**may not be reproduced or disclosed without

** may not be reproduced or disclosed without

**the prior written permission of the owner.

** the prior written permission of the owner.

*/

*/

虽然此版权比 Ingres 在后续版本中使用的 BSD 版权更具限制性,但有人可能会争辩说,由于其他 .h 或 .c 文件均不包含任何版权声明,因此 Ingres 的早期版本是真正开源的。

While this copyright is more restrictive than the BSD copyright used by Ingres in later releases, one could argue that, since none of the other .h or .c files contained any copyright notice, the early versions of Ingres were truly open source.

无论 Ingres 是否应该被视为开源软件的第一个例子,它确实是第一个以源代码形式发布的数据库系统,因此为新生的数据库社区提供了学术研究人员使用的第一个可用 DBMS 的例子可以学习修改。Ingres 的源代码在 DIRECT 的实现中发挥了关键作用,DIRECT 是我们构建并行数据库系统的第一个努力。

Whether or not Ingres should be considered as the very first example of open-source software, it was truly the first database system to be released in source-code form and hence provided the nascent database community the first example of a working DBMS that academic researchers could study and modify. The source code for Ingres played a critical role in the implementation of DIRECT, our first effort to build a parallel database system.

如上所述,Ingres 的初始版本是由四个进程实现的,如图6.1所示。

As mentioned above, the initial versions of Ingres were implemented as four processes, as shown in Figure 6.1.

进程 1 充当终端监视器。进程 2 包含解析器、目录支持、并发控制和实现查询修改以强制完整性约束的代码。一旦查询被解析,它就会被传递到进程 3 来执行。创建/删除表和索引的实用程序命令由进程 4 执行。

Process 1 served as a terminal monitor. Process 2 contained a parser, catalog support, concurrency control, and code to implement query modification to enforce integrity constraints. Once a query had been parsed it was passed to Process 3 for execution. Utility commands to create/drop tables and indices were executed by Process 4.

由于我们可以访问 Ingres 源代码,因此我们实施 DIRECT 的策略是尽可能多地重用 Ingres 代码。除了工艺 3 之外,我们可以使用其他 Ingres 工艺,只需进行很少的修改或无需修改。我们的流程 3 版本接受查询,将其转换为 DIRECT 可以执行的形式,然后将其发送到 DIRECT 的后端控制器,以便在四个 PDP 11/23 的集合上并行执行。

Since we had access to the Ingres source code, our strategy for implementing DIRECT was to reuse as much of the Ingres code as possible. Other than process 3, we were able to use the other Ingres processes with little or no modification. Our version of process 3 took the query, translated it into a form that DIRECT could execute, and then sent it DIRECT’s backend controller for parallel execution on a collection of four PDP 11/23s.

虽然访问 Ingres 源代码对于该项目至关重要,但 Mike 更进一步,让整个 Ingres 团队可以帮助我解决有关 Ingres 工作原理的问题,其中包括 Bob Epstein(他后来创办了 Britton Lee,然后创办了 Sybase)和 Eric Allman(因 Unix BSD 和 Sendmail 而闻名)。虽然 DIRECT 在很大程度上是失败的(它并行运行 QUEL 查询,但效率不高;[Bitton et al. 1983]),但如果没有 Mike 的慷慨支持,它根本不会成功。我不仅没有获得终身教职,而且当我在 1984 年初开始 Gamma 项目时,从 DIRECT 中学到的经验教训对于我能够取得更大的成功至关重要 [DeWitt 等人,2017]。1986]。

While access to Ingres source code was essential to the project, Mike went much further and made the entire Ingres team available to help me with questions about how Ingres worked, including Bob Epstein (who went on to start Britton Lee and then Sybase) and Eric Allman (of Unix BSD and Sendmail fame). While DIRECT was largely a failure (it ran QUEL queries in parallel but not very effectively; [Bitton et al. 1983]), without Mike’s generous support it would not have succeeded at all. Not only would I have not gotten tenure, but also the lessons learned from doing DIRECT were critical to being able to be more successful when I started the Gamma project in early 1984 [DeWitt et al. 1986].

图像

图 6.1   Ingres 流程结构。

Figure 6.1  Ingres process structure.

1983 年秋季——伯克利

Fall 1983—Berkeley

1983 年春天我获得了终身教职,现在是我休假的时候了。迈克邀请我来伯克利参加 1983 年秋季学期,为我们在伯克利山找到了一处出租的房子,甚至还借给我们一些房子附带的立体声系统的扬声器。虽然我花了这个学期的大部分时间继续一年前开始的威斯康星州基准测试工作,但迈克每周组织一次研讨会,研究可以利用大量主内存的新颖数据库算法。由此产生的 SIGMOD 1984 出版物 [DeWitt 等人。1984]做出了许多开创性的贡献,包括混合哈希连接和组提交算法。

Having been tenured in the Spring of 1983, it was time for me to take a sabbatical. Mike invited me to come to Berkeley for the Fall 1983 semester, found us a house to rent in Berkeley Hills, and even lent us some speakers for the stereo system that came with the house. While I spent a lot of the semester continuing the Wisconsin benchmarking effort that had started a year earlier, Mike organized a weekly seminar to study novel database algorithms that could take advantage of large amounts of main memory. The resulting SIGMOD 1984 publication [DeWitt et al. 1984] made a number of seminal contributions, including the hybrid hash join and group commit algorithms.

这也是 Ingres 的第一个商业版本 (RTI Ingres) 出现的时间,Mike 慷慨地让我们在一台运行 VMS 的 RTI 计算机上对其进行基准测试。虽然他对结果有一些小争议,但他对以下事实感到非常满意:对于基准测试中的几乎所有查询,Ingres 的商业版本明显快于其商业竞争对手 Oracle,结果引发了一系列基准战争。两个竞争对手的产品。

This was also the time that the first commercial version of Ingres (RTI Ingres) appeared and Mike graciously let us benchmark it on one of RTI’s computers running VMS. While he had some minor quibbles about the results, he was more than placated by the fact that the commercial version of Ingres was significantly faster than its commercial rival Oracle for essentially all the queries in the benchmark—results that launched a sequence of benchmark wars between the two rival products.

1988 年至 1995 年——Mike 没有绕道面向对象的 DBMS

1988–1995—No Object Oriented DBMS Detour for Mike

对关系模型的有限类型系统与用于开发数据库应用程序的语言之间的阻抗不匹配感到沮丧,并受到 Copeland 和 Maier 1984 年题为“使 Smalltalk 成为数据库系统”的 1984 年 SIGMOD 论文的启发 [1984],学术数据库和初创公司社区绕道尝试开发基于面向对象数据模型的新一代数据库系统。Mike 不相信数据模型的转变是明智的还是有道理的,他结合 Postgres(如图 6.2所示的四元图)镇压了叛乱,这是一本解释四元图的书(对象关系 DBMS:跟踪下一个伟大浪潮),与 Paul Brown)[Stonebraker 和 Moore 1996],还有一个很棒的保险杠贴纸:“网络空间的数据库。”

Frustrated by the impedance mismatch between the limited type system of the relational model and the languages used to develop database applications and inspired by the 1984 SIGMOD paper titled “Making Smalltalk a Database System” by Copeland and Maier [1984], the academic database and startup communities took a detour in an attempt to develop a new generation of database systems based on an object-oriented data model. Mike, unconvinced that a switch in data models was either wise or warranted, crushed the rebellion with a combination of Postgres, the quad chart shown in Figure 6.2, a book to explain the quad chart (Object-Relational DBMSs: Tracking the Next Great Wave, with Paul Brown) [Stonebraker and Moore 1996], and a fantastic bumper sticker: “The database for cyber space.”

遵循他在 Ingres 中成功使用的相同方法(并且在未来将继续多次使用),Mike 使用 Postgres 代码创建了 Illustra Information Technologies, Inc.,并于 1996 年初将其出售给了 Informix。关于此次出售,我印象最深刻的是吉姆·格雷 (Jim Gray) 在 SEC EDGAR 报告中发现了出售价格后,凌晨2 点左右给我打电话。告诉我。看起来四边形图表价值约 2 亿美元。

Following the same approach that he had used so successfully with Ingres (and would continue to use numerous times in the future), Mike used the Postgres code to start Illustra Information Technologies, Inc., which he sold to Informix in early 1996. The thing I remember most about the sale was that Jim Gray, having discovered the sale price in some SEC EDGAR report, called me about 2 A.M. to tell me. Seems the quad chart was worth about $200M.

图像

图 6.2   Mike 较为著名的四边形图之一。

Figure 6.2  One of Mike’s more famous quad charts.

值得反思的一件有趣的事情是,对 Illustra 的收购开始了 Informix 长期缓慢的衰落,因为事实证明,集成这两个代码库在技术上比最初假设的要困难得多。另一个有趣的结果是,Postgres 至今仍然是最流行的开源数据库系统之一。近年来,虽然 Mike 推行“一刀切”口号来证明一系列特定于应用程序的数据库系统(Vertica、Paradigm4 和 VoltDB)的合理性,但我始终提醒自己,Postgres 是迄今为止他最成功的 DBMS 成果,大多数肯定属于“一刀切”阵营。

One interesting thing to reflect on is that the Illustra acquisition began the long slow decline of Informix, as integrating the two code bases proved to be technically much more difficult than originally assumed. The other interesting outcome is that Postgres to this day remains one of the most popular open-source database systems. In recent years, while Mike has pushed the “one size fits none” mantra to justify a series of application-specific database systems (Vertica, Paradigm4, and VoltDB), I always remind myself that Postgres, by far his most successful DBMS effort, mostly definitely falls into the “one-size-fits-all camp.”

2000 年——红杉计划

2000—Project Sequoia

20 世纪 90 年代中期,美国国家航空航天局 (NASA) 发起了一项名为“行星地球任务”(MTPE) 的地球研究项目,使用一系列遥感卫星,旨在获取“全球气候变化关键参数的数据”。作为这项工作的一部分,NASA 发布了一份关于数据存储和处理组件替代设计的 RFP,这些组件需要存储和分析卫星在其生命周期内预计产生的数 TB 数据。该 RFP 启发 Mike 与 Jim Gray、Jeff Dozier 和 Jim Frew 一起启动了一个名为 Sequoia 2000 的项目,以设计和实现以数据库为中心的方法(主要基于 Postgres)。虽然 NASA 最终拒绝了他们的方法(相反,他们选择了基于 CORBA 的设计——还记得 CORBA 吗?),但这一努力激发了 Gray 与 Alex Szalay 的合作使用 SQL Server 进行斯隆数字巡天,事实证明这在空间科学界取得了巨大成功。

In the mid-1990s, NASA launched an effort to study the earth called “Mission to Planet Earth” (MTPE) using a series of remote sensing satellites designed to obtain “data on key parameters of global climate change.” As part of this effort, NASA issued an RFP for alternative designs for the data storage and processing components that would be needed to store and analyze the terabytes of data that the satellites were expected to generate over their lifetimes. This RFP inspired Mike, along with Jim Gray, Jeff Dozier, and Jim Frew, to start a project called Sequoia 2000 to design and implement a database-centric approach (largely based on Postgres). While NASA eventually rejected their approach (instead they picked a design based on CORBA—remember CORBA?), the effort inspired Gray to work with Alex Szalay to use SQL Server for the Sloan Digital Sky Survey, which proved to be a huge success in the space science community.

Sequoia 2000 项目也激励我和 Jeff Naughton 在威斯康星州启动 Paradise 项目。虽然我们确信以数据库为中心的方法的优点,但我们不认为单个 Postgres 实例就足够了,因此我们开始从头开始构建一个并行数据库系统,以应对 MTPE 数据存储和处理管道的技术挑战。虽然 Paradise 重用了我们作为 Gamma 项目的一部分开发的许多并行数据库技术,但它具有几个独特的功能,包括用于空间操作(例如,空间选择和连接)的全套并行算法以及对存储和连接的集成支持。在三级存储上处理卫星图像。

The Sequoia 2000 project also inspired myself and Jeff Naughton to start the Paradise project at Wisconsin. While we were convinced of the merits of a database-centric approach, we did not believe that a single Postgres instance would be sufficient and set out to build a parallel database system from scratch targeted at the technical challenges of the MTPE data storage and processing pipeline. While Paradise reused many of the parallel database techniques that we had developed as part of the Gamma project, it had several unique features, including a full suite of parallel algorithms for spatial operations (e.g., spatial selections and joins) and integrated support for storing and processing satellite imagery on tertiary storage.

2003 年—CIDR 会议启动

2003—CIDR Conference Launch

2002 年 6 月,我们所有的论文都被 2002 年 SIGMOD 会议的程序委员会主席 Mike Franklin 拒绝后,Mike、Jim Gray 和我决定开始一个新的数据库系统会议。我们强烈地感觉到,SIGMOD 在评估和接受面向系统的数据库研究论文时迷失了方向(我们在 2018 年再次遇到这种情况(参见第 11 章)。我们认为唯一可行的解​​决方案是开始一次新的会议,其目标是只接受有潜力推进最先进技术的论文。我们特别不希望人们提交精美的研究成果,而是更喜欢半生不熟的想法。时至今日,CIDR(创新数据系统研究会议)作为一个渠道继续蓬勃发展,其提交的材料远远多于单一会议所能接受的数量。

In June 2002, having had all our papers rejected by Mike Franklin, the program committee chair of the 2002 SIGMOD conference, Mike, Jim Gray, and I decided to start a new database system conference. We felt strongly that SIGMOD had lost its way when it came to evaluating and accepting systems-oriented database research papers (a situation we find ourselves in again in 2018 (see Chapter 11). We felt the only solution that was going to work was to start a new conference with an objective of accepting only papers that had the potential to advance the state of the art. We specifically did not want people to submit polished pieces of research, preferring instead half-baked ideas. To this day, CIDR, the Conference on Innovative Data Systems Research, continues to thrive as an outlet with far more submissions than a single-track conference can accept.

2005 年——麻省理工学院休假

2005—Sabbatical at MIT

我有幸在麻省理工学院度过了 2005-2006 学年。就在 Mike 启动 Vertica 构建基于列存储范例的可扩展数据仓库平台之后不久。虽然以面向列的格式存储表的想法可以追溯到 IBM 的 Raymond Lorie 的 XRM 项目,并且 CWI 的 MonetDB 项目已探索在主存数据库系统中使用,但 Vertica 是第一个使用存储表的并行数据库系统。专门作为列,以显着提高针对大型数据仓库操作的决策支持查询的性能。这是一个绝佳的机会,可以近距离观察迈克创办和运营他众多初创公司之一的情况。虽然 Vertica 在商业上取得了一定的成功,但它对 DBMS 领域产生了巨大的影响。提供柱状布局作为存储选项。ORC 和 Parquet 这两种用于“大数据”的主要 HDFS 文件格式也都使用列式存储布局。

I was fortunate to spend the 2005–2006 academic year at MIT. This was soon after Mike had started Vertica to build a scalable data warehousing platform based on a column store paradigm. While the idea of storing tables in a column-oriented format dates back to Raymond Lorie’s XRM project at IBM and had been explored for use in main memory database systems by the MonetDB project at CWI, Vertica was the first parallel database system that used tables stored as columns exclusively in order to dramatically improve the performance of decision-support queries operating against large data warehouses. It was an amazing opportunity to have a close-up look at Mike launching and running one of his numerous startups. While Vertica has had modest commercial success, it has had a huge impact on the DBMS field. Every major database platform today either uses a columnar layout exclusively or offers a columnar layout as a storage option. ORC and Parquet, the two major HDFS file formats for “big data,” also both use columnar storage layout.

2008 年——我们发表有关“MapReduce”的博客

2008—We Blog about “MapReduce”

2007 年,CS​​ 领域热议 MapReduce(Google 处理大量数据的范例)。坦率地说,我们对它所产生的炒作感到惊讶,并决定为 Vertica 网站写一篇博客文章来表达我们的反应 [DeWitt 和 Stonebraker 2008]。我们试图提出的关键点是,虽然 MapReduce 的容错方面很新颖,但并行数据库系统已使用基本处理范例超过 25 年。我们还认为放弃像 SQL 这样的声明性语言的力量,无论这种语言有多么有缺陷,因为过程式查询方法确实是一个糟糕的主意。这篇博文可能不会引起太多关注,除非它有“斜线”。非数据库社区的反应是令人难以置信的敌对。

In 2007, the CS field was abuzz about MapReduce, Google’s paradigm for processing large quantities of data. Frankly, we were amazed about the hype it was generating and decided to write a blog post for the Vertica website with our reactions [DeWitt and Stonebraker 2008]. The key point we were trying to make was that while the fault tolerant aspects of MapReduce were novel, the basic processing paradigm had been in use by parallel database systems for more than 25 years. We also argued abandoning the power of a declarative language like SQL, however flawed the language might be, for a procedural approach to querying was a really bad idea. The blog post would probably not have attracted much attention except that it got “slashdotted.” The reaction of the non-database community was incredible hostile. “Idiots” was one of the kinder adjectives applied to us.

十年后反思这篇博文很有趣。虽然今天可能有一些核心黑客使用 MR,但每个主要的大数据平台(Hive、Presto、Impala、Cloudera、Red Shift、Spark、Google BigQuery、Microsoft SQL DW、Cosmos/Scope 等)都使用 SQL 作为接口。虽然 Hive 的早期版本使用 MapReduce 作为其执行引擎,但当前的 Hive 执行器 Tez 使用矢量化执行器与标准并行查询机制相结合,该机制最初是 20 世纪 80 年代初作为 Gamma 项目的一部分开发的。

It is interesting to reflect on this blog post ten years later. While perhaps a few hardcore hackers use MR today, every major big data platform (Hive, Presto, Impala, Cloudera, Red Shift, Spark, Google BigQuery, Microsoft SQL DW, Cosmos/Scope, etc.) all use SQL as their interface. While early versions of Hive used MapReduce as its execution engine, Tez, the current Hive executor, uses a vectorized executor in combination with standard parallel query mechanisms first developed as part of the Gamma project in the early 1980s.

MR 粉丝们今天在哪里?他们欠我们一个道歉。

Where are the MR fan boys today? They owe us an apology.

2014年——终于获得图灵奖

2014—Finally, a Turing Award

1998 年,我领导提名吉姆·格雷获得图灵奖。当我开始这项工作时,我花了相当多的时间与自己争论是先提名吉姆还是迈克。那时,两者都对我们的领域产生了巨大的影响,我根本不知道谁应该被提名第一(System R 和 Ingres 已于 1988 年与 ACM 软件系统奖共同获得认可)。虽然迈克的出版记录显然更广泛,但吉姆在交易方面的开创性工作 [Eswaran 等人。1976] 在我看来,这是图灵奖提名的理想基础。我仍然记得当时认为迈克将在吉姆之后不久获得成功提名。吉姆获得图灵奖后不久,他被任命加入图灵奖评选委员会,在任职期间,他觉得支持迈克的提名是不合适的。不幸的是,在这一切发生之前,它于 2007 年 1 月在海上失踪。遗憾的是,图灵奖的评选过程充满政治色彩,以至于花了 16 年时间才认可 Mike 对研究和工业数据库社区的重大贡献。我从来没想过事情会变成这样。如果我在 1998 年就知道现在的情况,我可能会采取不同的做法。

In 1998, I led the effort to nominate Jim Gray for a Turing Award. When I started that effort, I spent a fair amount of time debating with myself whether to nominate Jim or Mike first. By that time, both had had a huge impact on our field and it was not at all obvious to me who should be nominated first (System R and Ingres had been jointly recognized with the ACM Software Systems Award in 1988). While Mike’s publication record was clearly broader, Jim’s pioneering work on transactions [Eswaran et al. 1976] seemed to me to be ideal to base a Turing Award nomination on. I still recall thinking that a successful nomination for Mike would follow shortly after Jim. Soon after Jim received the Turing Award, he was appointed to join the Turing Award selection committee and, while serving, felt it inappropriate to support a nomination for Mike. He kept promising that once he was off the committee he would strongly support a nomination for Mike but unfortunately was lost at sea in January 2007 before that came to pass. It is sad that the Turing Award process is so political that it took 16 years to recognize Mike’s major contributions to both the research and industrial database communities. It was never my intention for it to turn out this way. Had I known in 1998 what I know now, I probably would have done things differently.

2016——我登陆麻省理工学院

2016—I Land at MIT

距离我们在密歇根第一次见面已经近 50 年了,迈克和我再次发现自己身处同一个地方,这次是在麻省理工学院。虽然我们的职业生涯都已接近尾声(至少我是这样),但我们希望这将是一次共同完成最后一个项目的机会。

Almost 50 years since we first met at Michigan, Mike and I again find ourselves at the same place, this time at MIT. While we are both nearing the end of our careers (well, at least I am) our hope is that this will be an opportunity to do one last project together.

2017年

2017

在我职业生涯的每一个重要时刻,迈克都在那里为我提供帮助和建议。我欠他巨大的感激之情。但我还要感谢安排 Mike 在 1970 年秋季担任计算机架构入门课程助教的人。与 Mike 的那次偶遇对我的职业生涯和生活都产生了深远的影响。

At every important juncture of my career, Mike was there to give me assistance and advice. I owe him a huge debt of gratitude. But I also owe thanks to whoever scheduled Mike to be the TA for that introductory computer architecture class in the Fall of 1970. That chance encounter with Mike turned out to have a profound impact on both my career and my life.

1 . http://highgate.comm.sfu.ca/pups/4BSD/Distributions/4.2BSD/ingres.tar.gz

1. http://highgate.comm.sfu.ca/pups/4BSD/Distributions/4.2BSD/ingres.tar.gz

第五部分

PART V

初创公司

STARTUPS

7

7

如何通过五个(不太)简单的步骤创办一家公司

How to Start a Company in Five (Not So) Easy Steps

迈克尔·斯通布雷克

Michael Stonebraker

介绍

Introduction

本章描述了我在创办九家风险投资支持的公司时所使用的方法。它假设如下。

This chapter describes the methodology I have used in starting nine venture capital-backed companies. It assumes the following.

(a) 我有创办系统软件公司的经验。对于硬件公司和生物技术等应用程序来说,该过程有很大不同。本文中的任何内容均适用于系统软件之外。

(a)  I have experience in starting system software companies. The procedure is quite different for hardware companies and applications such as those in biotech. Nothing herein applies outside of system software.

(b) 它发生在大学环境中,人们试图将大学研究项目的想法商业化。

(b)  It occurs in a university context, whereby one is trying to commercialize an idea from a university research project.

(c) 需要来自风险投资 (VC) 界的资金。我没有天使投资人的经验。我不喜欢自筹资金的初创公司。除非你是独立富有的人,否则你将需要一份白天的工作,而你的创业将只能在晚上和周末进行。在系统软件中,由于通常需要编写大量代码,兼职工作确实很困难。

(c)  One requires capital from the venture capital (VC) community. I have no experience with angel investors. I am not a fan of self-funded startups. Unless you are independently wealthy, you will need a day job, and your startup will be nights and weekends only. In system software, a part-time effort is really difficult because of the amount of code that typically needs to be written.

本章分为五个步骤,按顺序执行。最后,还有一些五花八门的评论。在每一步中,我都会以一个例子来介绍我们如何完成组建我的最新公司 Tamr, Inc. 的这一步。

This chapter is divided into five steps, to be performed in order. At the end, there are some miscellaneous comments. Along with each step, I present, as an example, how we accomplished the step in the formation of my latest company, Tamr, Inc.

第一步:有一个好主意

Step 1: Have a Good Idea

创办公司的第一步是要有一个好主意。一个好的想法是具体的——换句话说,它能够原型实现,而不需要进一步的规范。像“我想在医疗可穿戴设备领域做点什么”这样的想法还不够具体。在我常去的麻省理工学院,许多教职员工的午餐会上都会提出一些好主意。因此,我的建议是将自己定位于像这样的研究机构CSAIL/MIT 或 CS/Berkeley 或……好的想法似乎就是从这样的环境中流出的。

The first step in starting a company is to have a good idea. A good idea is specific—in other words, it is capable of prototype implementation, without further specification. An idea like “I would like to do something in medical wearables” is not specific enough. At MIT, where I hang out, good ideas are presented in many of the faculty lunches. Hence, my advice is to situate yourself at a research institution like CSAIL/MIT or CS/Berkeley or .… Good ideas seem to flow out of such environments.

那么,如果你不在一个创意丰富的地方,你会怎么做?答案是:旅行!例如,我的一位合作者是另一所大学的计算机科学教授和研究员。然而,他在 CSAIL 度过了大量时间,与麻省理工学院的教师、学生和博士后互动。大多数大学都欢迎这种交叉融合。如果你在内地,那就找个肥沃的地方,花机票去那里逛逛。

So what do you do if you are not in an idea-fertile place? The answer is: travel! For example, one of my collaborators is a CS professor and researcher at another university. However, he spends a considerable amount of time at CSAIL, where he interacts with the faculty, students, and postdocs at MIT. Most universities welcome such cross-fertilization. If you are in the hinterlands, then find a fertile place and spend airplane tickets to hang out there.

如何判断一个想法是否值得商业化?答案是“鞋革”。与潜在用户交谈并了解他们对您的建议的反应。使用此反馈来完善您的想法。你越早强化或放弃一个想法,你的处境就越好,因为你可以将精力集中在有价值的想法上。

How does one decide if an idea is worthy of commercialization? The answer is “shoe leather.” Talk to prospective users and get their reaction to your proposal. Use this feedback to refine your ideas. The earlier you can reinforce or discard an idea, the better off you are because you can focus your energy on ideas that have merit.

如果你想出了一个“想法空洞”,那么请考虑加入其他有好想法的人。初创企业始终是团队努力的结果,加入别人的团队是一件完全合理的事情。

If you come up with an “idea empty hole,” then consider joining somebody else who has a good idea. Startups are always a team effort, and joining somebody else’s team is a perfectly reasonable thing to do.

一旦有了好主意,您就可以进入第 2 步了。

Once you have a good idea, you are ready to advance to step 2.

就 Tamr 而言,加州大学伯克利分校的乔伊·海勒斯坦 (Joey Hellerstein) 正在哈佛大学休假,我们开始开会讨论可能的项目。我们很快决定要探索数据集成。在之前的一家公司(Goby),我遇到过海量数据的数据集成问题,这个话题出现在想要合并来自多个业务部门的数据的大型企业中,这些企业通常不遵守任何特定的命名标准、数据格式化、数据清理技术等等。因此,Joey 和我很快就开始关注这一点。此外,麻省理工学院正在与卡塔尔计算研究所(QCRI)建立合作关系。QCRI 的 Ihab Ilyas 和 George Beskales 开始与我们合作。Goby 的主要问题是从代表同一实体的多个源中删除重复的数据记录,例如,使用有关 Mike Stonebraker 的信息对各种公共数据源进行重复数据删除。QCRI 团队专注于这一领域,让 MIT 负责模式匹配。最后,Stan Zdonik(来自布朗大学)和 Mitch Cherniack(来自布兰迪斯大学)与 Alex Pagan(麻省理工学院研究生)就专家采购(即众包,但应用于企业内部并假设专业水平)进行合作。他们同意将他们的模型应用于虾虎鱼数据。现在我们有了主要想法,可以专注于构建原型。但在企业内部应用并假设专业水平)。他们同意将他们的模型应用于虾虎鱼数据。现在我们有了主要想法,可以专注于构建原型。但在企业内部应用并假设专业水平)。他们同意将他们的模型应用于虾虎鱼数据。现在我们有了主要想法,可以专注于构建原型。

In the case of Tamr, Joey Hellerstein of UC Berkeley was spending a sabbatical at Harvard, and we started meeting to discuss possible projects. We quickly decided we wanted to explore data integration. In a previous company (Goby), I had encountered the issue of data integration of very large amounts of data, a topic that arises in large enterprises that want to combine data from multiple business units, which typically do not obey any particular standards concerning naming, data formatting, techniques for data cleaning, etc. Hence, Joey and I quickly homed in on this. Furthermore, MIT was setting up a collaboration with the Qatar Computing Research Institute (QCRI). Ihab Ilyas and George Beskales from QCRI started collaborating with us. Goby’s major problem was deduplicating data records from multiple sources that represented the same entity, for example deduplicating the various public data sources with information about Mike Stonebraker. The QCRI team focused on this area, leaving MIT to work on schema matching. Last, Stan Zdonik (from Brown) and Mitch Cherniack (from Brandeis) were working with Alex Pagan (an MIT graduate student) on expert sourcing (i.e., crowdsourcing, but applied inside the enterprise and assuming levels of expertise). They agreed to apply their model to the Goby data. Now we had the main ideas and could focus on building a prototype.

第 2 步:组建团队并构建原型

Step 2: Assemble a Team and Build a Prototype

到目前为止,您希望有一些朋友同意加入您的努力。如果没有,那就招募他们。你应该看哪里?在创意工厂你出去玩的地方!如果你找不到有能力的程序员加入你的团队,那么你应该质疑你是否有一个好主意。

By now, you hopefully have a few friends who have agreed to join you in your endeavor. If not, then recruit them. Where should you look? At the idea factory where you hang out! If you cannot find competent programmers to join your team, then you should question whether you have a good idea or not.

这个初始团队应该分工来构建原型。这项工作的时间不应超过团队三个月的时间。如果需要超过三个月的时间,则修改工作范围以使原型更简单。硬编码功能是完全可以的。例如,变成 Vertica 的 C-Store 原型恰好运行了 7 个查询,其执行计划是硬编码的!

This initial team should divide up the effort to build a prototype. This effort should be scoped to take no more than three months of your team’s effort. If it takes more than three months, then revise the scope of the effort to make your prototype simpler. It is perfectly OK to hard-code functionality. For example, the C-Store prototype that turned into Vertica ran exactly seven queries, whose execution plans were hard-coded!

换句话说,你的原型不需要做太多事情。但是,请确保有图形用户界面(GUI);命令行界面会让风险投资家的眼睛变得呆滞。

In other words, your prototype does not need to do much. However, make sure there is a graphical user interface (GUI); command line interfaces will make the eyes of the VCs glaze over.

风险投资人需要看到能证明你的想法的东西。请记住,他们是商人,而不是技术专家。您的原型应该简单明了,并且用不超过五分钟的时间来展示这个想法。想想《创智赢家》,而不是计算机科学研究生班。

VCs need to see something that demonstrates your idea. Remember that they are business people and not technologists. Your prototype should be simple and crisp and take no more than five minutes to demonstrate the idea. Think “Shark Tank,” not a computer science graduate class.

您的原型几乎肯定是完全一次性的代码,因此不必担心使代码变得干净且可维护。目标是尽快运行一个简单的演示。

Your prototype will almost certainly be total throwaway code, so don’t worry about making the code clean and maintainable. The objective is to get a simple demo running as quickly as possible.

就 Tamr 而言,我们只需要敲出上述想法的代码即可,这是 QCRI 和 MIT 之间的团队努力。在此过程中,我们发现了另外两个用例。

In the case of Tamr, we merely needed to whack out the code for the ideas discussed above, which was a team effort between QCRI and MIT. Along the way, we found two more use cases.

首先,诺华一直在尝试为大约 10,000 名实验室科学家集成电子实验室笔记本。实际上,他们想要集成 10,000 个电子表格,并且在过去 3 年里一直在尝试各种技术。他们很高兴将他们的数据结构提供给我们使用。这为我们提供了一个模式集成用例。最后,通过麻省理工学院工业联络计划(ILP),我们与 Verisk Health 取得了联系。他们正在整合来自 30 多个来源的保险索赔数据。他们有一个主要的实体整合问题,因为他们希望汇总不同医生的索赔数据。然而,存在很大的歧义:两个具有相同姓氏和相同地址的医生可能是父子诊所,也可能是数据错误。我们还有另一个“野外”实体整合问题。在数据驯服者中,我们不专注于修复,只专注于检测。Verisk Health 有一个人在环来纠正这些情况。

First, Novartis had been trying to integrate the electronic lab notebooks for about 10,000 bench scientists. In effect, they wanted to integrate 10,000 spreadsheets and had been trying various techniques over the previous 3 years. They were happy to make their data structures available for us to work on. This gave us a schema integration use case. Last, through the MIT Industrial Liaison Program (ILP), we got in touch with Verisk Health. They were integrating insurance claim data from 30-plus sources. They had a major entity consolidation problem, in that they wanted to aggregate claims data by unique doctor. However, there was substantial ambiguity: two doctors with the same last name at the same address could be a father-and-son practice or a data error. We had another, “in the wild,” entity consolidation problem. In Data Tamer, we did not focus on repair, only on detection. Verisk Health had a human-in-the-loop to correct these cases.

我们的原型比 Goby 手工编写的代码运行得更好,并且与专业数据集成服务在 Verisk Health 数据上自动匹配的结果相当。最后,原型似乎为诺华数据提供了一种有前途的方法。因此,我们有了一个原型和三个看起来可行的用例。

Our prototype worked better than the Goby handcrafted code and equaled the results from the automatic matching of a professional data integration service on Verisk Health data. Last, the prototype appeared to offer a promising approach to the Novartis data. Hence, we had a prototype and three use cases for which it appeared to work.

有了原型,您可以转到步骤 3。

With a prototype, you can move to step 3.

第 3 步:寻找 Lighthouse 客户

Step 3: Find a Lighthouse Customer

风险投资人会问你的第一个问题是:“谁是你的客户?” 得到答案有很大帮助。你应该去找几个企业,他们会说:“如果你真的做这个,那我会考虑买它。” 这样的灯塔客户必须是真实存在的,也就是说,他们不可能是你的婆婆。风险投资家会要求与他们交谈,以确保他们看到与你相同的价值主张。

The first question a VC will ask you is, “Who is your customer?” It helps a lot to have an answer. You should go find a couple of enterprises that will say, “If you build this for real, then I will consider buying it.” Such lighthouse customers must be real, that is, they cannot be your mother-in-law. VCs will ask to talk to them, to make sure that they see the same value proposition that you do.

如果找不到灯塔客户怎么办?一个答案是“更加努力”。另一个答案是开始质疑你是否有一个好主意。如果没有人想要你的想法,那么,根据定义,这不是一个好主意。

What should you do if you can’t find a lighthouse customer? One answer is to “try harder.” Another answer is to start questioning whether you have a good idea. If nobody wants your idea, then, by definition, it is not a good idea.

如果您缺乏与商界人士的联系,那么尝试参加“聚会”。网络、网络、网络是无可替代的。经过努力,如果你仍然找不到灯塔客户,那么你可以继续下一步的过程,但这会让事情变得更加困难。

If you lack contacts with business folks, then try hanging out at “meetups.” There is no substitute for network, network, network. After trying hard, if you still can’t find a lighthouse customer, then you can continue to the next step of the process, but it will make things much harder.

就 Tamr 而言,我们有三个灯塔客户,如上一节所述。所有人都很乐意与感兴趣的人交谈,这就是下一步的重点。

In the case of Tamr, we had three lighthouse customers, as noted in the previous section. All were happy to talk to interested people, which is what the next step is all about.

第四步:招募成人监督

Step 4: Recruit Adult Supervision

风险投资家会对一个由 23 岁工程师组成的团队表示怀疑。一般来说,风险投资家会希望你的团队具有一定的商业头脑。换句话说,他们会想要“以前做过”的人。尽管也有例外(例如马克·扎克伯格和 Facebook),但作为一般规则,团队必须有一名业务开发或销售主管。换句话说,必须有人能够管理公司,而风险投资家不会将其委托给没有经验的人。尽管你的团队可能有 MBA 类型,但风投们会对 23 岁的人表示怀疑。风险投资人规避风险,希望将执行任务委托给“已经走过几次路”的人。

VCs will look askance at a team composed of 23-year-old engineers. In general, VCs will want some business acumen on your team. In other words, they will want somebody who has “done it before.” Although there are exceptions (such as Mark Zuckerberg and Facebook), as a general rule a team must have a business development or sales executive. In other words, somebody has to be available to run the company, and the VCs will not entrust that to somebody with no experience. Although your team may have an MBA type, VCs will look askance at a 23-year-old. VCs are risk-averse and want to entrust execution to somebody who has “been around the block a few times.”

那么如何找到经验丰富的高管呢?答案很简单:鞋革。您将需要广泛的网络。确保你找到一个可以相处的人。如果你们的关系在第一天的动态不是很好,那么很可能会变得更糟。你会花很多时间和这个人在一起,所以要确保它会起作用。

So how do you find a seasoned executive? The answer is simple: shoe leather. You will need to network extensively. Make sure you find somebody you can get along with. If the dynamics of your relationship are not terrific on Day 1, they are likely just to get worse. You will be spending a lot of time with this person, so make sure it is going to work.

另外,请确保此人具有业务开发或销售背景。您并不是特别寻找副总裁/工程人员,尽管其中之一就很好了。相反,您正在寻找能够制定上市策略、撰写商业计划并与风险投资人互动的人。

Also, make sure this person has a business development or sales pedigree. You are not particularly looking for a VP/Engineering, although one of those would be nice. Instead, you are looking for someone who can construct a go-to-market strategy, write a business plan, and interact with the VCs.

如果找不到这样的人怎么办?好吧,不要绝望;您可以在没有成人监督的情况下继续下一步。然而,下一步将更加艰难……

What happens if you can’t find such a person? Well, do not despair; you can continue to the next step without adult supervision. However, the next step will be tougher …

就 Tamr 而言,我联系了安迪·帕尔默 (Andy Palmer),他曾是我们共同创立的前一家公司 (Vertica) 的首席执行官。Andy当时在诺华工作,与诺华工程师验证了数据集成的重要性。此外,他还邀请Informatica(数据集成软件市场的主导企业)谈论他们如何解决诺华的问题。很明显,Novartis 确实想要解决其数据集成挑战,而 Informatica 却做不到。有了这些信息,安迪同意担任 Tamr 的首席执行官。1

In the case of Tamr, I reached out to Andy Palmer, who had been the CEO of a previous company (Vertica) that we co-founded. Andy worked at Novartis at the time and verified the importance of data integration with Novartis engineers. In addition, he invited Informatica (a dominant player in the data-integration software market) to talk about how they could solve the Novartis problem. It became clear that Novartis really wanted to solve its data integration challenge and that Informatica could not do so. Armed with that information, Andy agreed to become CEO of Tamr.1

第 5 步:准备推介材料并征求风险投资人

Step 5: Prepare a Pitch Deck and Solicit the VCs

在此过程中,您希望拥有一个或多个灯塔客户以及一个包含软件开发人员和至少一名“成年人”的团队。现在你已经准备好向风险投资家推销了。您的成人应负责演示;然而,你应该记住,风险投资人的注意力持续时间很短——不超过 30 分钟。你的幻灯片不应超过 15 张,并且你的演讲应包括 5 分钟的演示。

By this point in the process, you hopefully will have one or more lighthouse customers and a team that includes software developers and at least one “adult.” You are now ready to pitch the VCs. Your adult should be in charge of the presentation; however, you should remember that VCs have a short attention span—no more than 30 minutes. Your deck should not have more than 15 slides, and your pitch should include a 5-minute demo.

你如何找到风投来进行推介?四处询问:换句话说,就是网络。在热点地区(硅谷、西雅图、波士顿、纽约等),风投几乎遍布每个街角。如果您位于内陆地区,那么您应该考虑搬到热点地区。此外,在内陆地区你不太可能找到良好的成人监督。

How do you find VCs to pitch? Ask around: in other words, network. In the hotspots (Silicon Valley, Seattle, Boston, New York, etc.), VCs are literally on every street corner. If you are in the hinterlands, then you should consider moving to a hotspot. Moreover, it is unlikely that you can find good adult supervision in the hinterlands.

查看与您互动的任何风险投资人的声誉。每个 VC 都有先于他或她的声誉,无论好坏。远离任何在公平方面没有良好声誉的人。我听过足够多的企业家被风投或风投引进的首席执行官利用的恐怖故事,我自己在这方面也有过一次痛苦的经历。我的强烈建议是:如果感觉不对,你应该换个方向跑。

Check out the reputation of any VC with whom you interact. Each VC has a reputation that precedes him or her, good or bad. Run away from anybody who does not have a stellar reputation for fairness. I have heard enough horror stories of entrepreneurs getting taken advantage of by VCs or by CEOs brought in by the VCs, and I have had one painful experience myself in this area. My strong advice: If it doesn’t feel right, you should run the other way.

现在,尝试向几位“友好”的风险投资家推销你的产品。他们肯定会毁掉你的推销,你现在可以做第一次,这可能会是几次迭代。一段时间后,你的融资演讲稿就会变得更好。预计会向多对多的风险投资家进行推介,并在这个过程中花费数月时间。

Now, try out your pitch on a couple of “friendly” VCs. They will trash your pitch for sure, and you can now do the first of what will probably be several iterations. After a while your pitch deck will get better. Expect to pitch several-to-many VCs and to spend months on this process.

每个 VC 都会给你以下三种反应之一。

Every VC will give you one of three reactions.

• “我会尽快回复您。” 这是“我不感兴趣”的代码。

• “I will get back to you.” This is code for, “I am not interested.”

• “我不感兴趣,因为[插入一些通常不相关的原因]。” 这是一种更真诚的说法,“我不感兴趣”。

•  “I am not interested, because [insert some usually irrelevant reason].” This is a slightly more genuine way of saying, “I am not interested.”

• “捡石头”。风险投资家会要求你尝试一下你的演讲:

•  A “rock fetch.” The VC will ask you to try out your pitch on:

■ 可能的高管(加强第 4 步)

■  possible executives (to beef up step 4)

■ 可能的客户(加强第 3 步)

■  possible customers (to beef up step 3)

■ 他们的朋友——帮助他们评估你的建议

■  their friends—to help them evaluate your proposal

■ 投资组合中的公司——再次帮助他们评估您的提案

■  companies in their portfolios—again to help them evaluate your proposal

在每种情况下,风险投资家都会收集有关您的商业提案有效性的信息。您应该预料到会进行多对多的取石,并且该过程会持续数周。尽管取石非常令人沮丧,但你实际上只有两个选择。

In each case, the VC is gathering information on the validity of your business proposal. You should expect several-to-many rock fetches and the process to go on for weeks. Although rock fetches are very frustrating, you really have only two choices.

1. 进行 i+1st 2 次取石。

1.  Do the i+1st2 rock fetch.

2.告诉VC你不感兴趣。

2.  Tell the VC you are not interested.

一般来说,你可能需要翻开很多石头才能找到钻石。尽可能享受这个过程。希望这一切结束时,风险投资家会说他或她想给你一份投资意向书(他提议的投资的详细信息)。如果发生这种情况,那么很可能会出现“堆积”。以前不感兴趣的VC可能会突然变得很感兴趣。换句话说,就是有一种“随波逐流”的心态。不幸的是,这通常不会导致交易价格上涨;只会有更多的人围坐在谈判桌旁来分摊这笔交易。上述说法不适用于估值极高的“独角兽”(Facebook 或 Uber 等公司)。我们其他人都坚持脚踏实地的价值观。

In general, you may have to turn over a lot of rocks to find a diamond. Enjoy the process as best you can. Hopefully, this ends with a VC saying he or she wants to give you a term sheet (details of his proposed investment). If this happens, then there may well be a “pile on.” Previous VCs who were not interested may suddenly become very interested. In other words, there is a “follow the crowd” mentality. Unfortunately, this will not usually result in the price of the deal rising; there will just be more people around the table to split the deal. The above statements do not apply to “unicorns” (companies like Facebook or Uber), which have valuations in the stratosphere. The rest of us are stuck with down-to-earth values.

当您收到投资意向书时,找到“以前做过”的人来帮助您谈判交易是至关重要的。非财务术语可能比财务术语更重要,因此请密切关注。

When you receive a term sheet, it is crucial that you find somebody who has “done it before” to help you negotiate the deal. The non-financial terms are probably more important than the financial ones, so pay close attention.

最繁琐的术语是“清算优先权”。在任何流动性事件中,风险投资公司通常会要求他们在普通股股东得到任何东西之前收回资金。这称为 1× 偏好。然而,我看到条款清单提出了 3× 优先权。假设您在多轮融资中接受了 2000 万美元的风险投资。凭借 3 倍的优先权,风投公司在创始人和员工得到任何东西之前先获得 6000 万美元。如果风险投资公司拥有 60% 的股票和 3× 优先权,那么 8000 万美元的流动性事件将意味着 VC 获得 7200 万美元,其他人获得 800 万美元。正如您所看到的,这并不是一个了不起的结果。因此,请特别注意此术语和类似术语。

The most onerous term is “liquidation preference.” VCs will routinely demand that they get their money back in any liquidity event before the common stockholders get anything. This is known as a 1× preference. However, I have seen term sheets that propose a 3× preference. Suppose you accept $20M of venture capital over multiple financing rounds. With a 3× preference, the VCs get the first $60M before the founders and employees get anything. If the VCs have 60% of the stock and a 3× preference, then a liquidity event of $80M will mean the VCs get $72M and others get $8M. As you can see, this is not a terrific outcome. Hence, pay careful attention to this and similar terms.

与风险投资公司谈判的另一件事是,他们定期进行交易,而你则不然。这让人想起互联网出现之前的汽车市场。风险投资家会抛出诸如“这个薪资不是市场”或“这个股票头寸不是市场”之类的声明。很难争论,因为他们拥有所有数据,而你没有。我唯一的建议是找一个曾经做过这件事的人来帮助谈判,并让你远离否则会遇到的各种沙坑。

The other thing about negotiating with the VCs is that they do deals on a regular basis and you don’t. It is reminiscent of the automobile market before the Internet. VCs will throw around statements like “this salary is not market” or “this stock position is not market.” It is difficult to argue since they have all the data and you don’t. My only suggestion is to get somebody who has done it before to help with the negotiation, and to keep you out of all the assorted sand traps that you would otherwise encounter.

期望与 VC 的谈判是一个来回的过程。我的建议是首先解决所有非财务条款。其次,确保您有想要筹集的金额。一般来说,这应该足以让您的产品在您的灯塔客户之一投入生产使用。不要忘记必须包括质量保证 (QA) 和文档。当你认为你有一个数字时,那就加倍,因为企业家是出了名的乐观。然后,谈判归结为加薪 X 美元以换取公司 Y% 的股份。此外,剩余的资金必须在创始人和期权池之间分配,以供第一年雇用的其他员工使用。因此,唯一需要协商的是 Y 和期权池的大小。

Expect the negotiation with the VC (or VCs) to be a back-and-forth process. My advice is to settle all the non-financial terms first. Second, make sure you have the amount that you want to raise. In general, this should be enough money to get your product into production use at one of your lighthouse customers. Don’t forget that quality assurance (QA) and documentation must be included. When you think you have a number, then double it, because entrepreneurs are notoriously optimistic. Then, the negotiation boils down to a raise of $X in exchange for Y% of the company. In addition, the remainder must be split between the founders and an option pool for other employees to be hired in the first year. Hence, the only things to be negotiated are Y and the size of the option pool. My advice is to fix essentially everything and then negotiate the price of the deal (Y above).

就 Tamr 而言,我和安迪向 New Enterprise Associates 推荐了一位友好的风险投资人,他当场就接受了这笔交易,只要条款在他可以接受的范围内,而无需询问他的合作伙伴。他的优势在于他了解数据集成领域以及我们尝试做的事情的重要性。有了他的接受,我们发现招募另一位风险投资家来提供我们所需的另一半资本相对容易。

In the case of Tamr, Andy and I pitched a friendly VC from New Enterprise Associates, who took the deal on the spot, as long as the terms were within what he could accept without asking his partners. He had the advantage that he knew the data integration space and the importance of what we were trying to do. Armed with his acceptance, we found it relatively easy to recruit another VC to supply the other half of the capital we needed.

假设你能与 VC 达成协议,那么你就可以参加比赛了。在下一节中,我将针对上面未讨论的主题收集随机评论。

Assuming you can come to terms with the VC (or VCs), you are off to the races. In the next section, I make a collection of random comments on topics not discussed above.

评论

Comments

不花钱

Spend No Money

当公司耗尽资金时就会失败,除此之外别无选择。不要把你的资金花在甲级办公空间、接待员、办公家具(如果你不能从其他地方找到的话,就从宜家购买)、汽车服务或旅行的人上安排。另外,不要雇用销售人员——这是团队中成年人的工作。您的团队应该尽可能精简!俗话说“亲吻每一分钱”。

Companies fail when they run out of money, and in no other way. Do not spend your capital on Class A office space, a receptionist, office furniture (buy from Ikea if you can’t scrounge it from somewhere else), a car service, or somebody to make travel arrangements. Also, do not hire a salesperson—that is the job of the adult on your team. Your team should be as lean as possible! The adage is “kiss every nickel.”

知识产权

Intellectual Property

我经常被问到:“知识产权呢?” 我是开源软件的忠实粉丝。如果你在大学里从事开发工作,那么就将你的工作宣布为开源,并让你的公司干净利落地“把代码扔到墙外”。否则,您将不得不与所在大学的技术许可办公室 (TLO) 进行谈判。我很少看到这次谈判进展顺利。你的公司可以是开源的,也可以是闭源的,并且可以在翻墙之后提交专利申请。我也喜欢为你的大学保留大量股票,所以如果你的公司表现良好,他们会得到意外之财。此外,规定股票流向计算机科学领域,而不是流向普通大学的金库。

I am routinely asked, “What about intellectual property?” I am a huge fan of open-source software. If you did development at a university, then declare your efforts open source, and make a clean “toss the code over the wall” into your company. Otherwise, you are stuck negotiating with the Technology Licensing Office (TLO) of your university. I have rarely seen this negotiation go well. Your company can be open or closed source, and can submit patent applications on work done after the wall toss. I am also a fan of reserving a block of stock for your university, so they get a windfall if your company does well. Also, dictate the stock go to Computer Science, not to the general university coffers.

您的风险投资家将鼓励您提交专利申请。这完全是为了让你可以对潜在客户说“专利XXX”,他们认为这意味着什么。提交申请和一个月的时间大约需要花费 25,000 美元。在这一点上,请与风险投资家一起行动。

Your VCs will encourage you to submit patent applications. This is totally so you can say “patented XXX” to prospective customers, who think this means something. It costs about $25K to submit an application and a month of your time. Go along with the VCs on this one.

我很少见过经得起审查的软件专利。总有现有技术,或者您因专利侵权而起诉的公司正在做一些不同的事情。无论如何,初创公司绝不会提起此类诉讼。这要花太多钱。根据我的经验,软件专利反而被大公司用来获得商业优势。

I have rarely seen a software patent stand up to scrutiny. There is always prior art or the company you are suing for patent infringement is doing something a bit different. In any case, startups never initiate such suits. It costs too much money. In my experience, software patents are instead used by big companies for business advantage.

就 Vertica 而言,我们与之竞争(并且经常获胜)的一家大公司起诉我们侵犯了他们的一项旧专利。以整数计算,我们为自己辩护花费了 100 万美元,他们为推动诉讼花费了 100 万美元。由于他们是一家大公司,因此 100 万美元只是小钱。我们的 100 万美元是极其宝贵的资本。此外,他们告诉我们所有的潜在客户,“不要购买 Vertica,因为他们正在被起诉。” 尽管我们最终获胜,但他们确实拖慢了我们的速度并分散了管理层的注意力。在我看来,专利程序经常被滥用,迫切需要进行大规模改革。

In the case of Vertica, a large company against which we were competing (and routinely winning) sued us for patent infringement of one of their old patents. In round numbers, it cost us $1M to defend ourselves and it cost them $1M to push the suit. Their $1M was chump change since they were a large company; our $1M was incredibly valuable capital. Moreover, they told all of our prospects, “Don’t buy Vertica, since they are being sued.” Although we ultimately won, they certainly slowed us down and distracted management. In my opinion, the patent process is routinely abused and desperately needs a massive overhaul.

前五位客户

First Five Customers

结识前五位顾客是成年人的工作。不要担心从他们那里得到很多钱。相反,您只需要他们向其他潜在客户称赞您的产品即可。

It is the job of your adult to close the first five customers. Do not worry about getting lots of money from them. Instead, you just need them to say nice things to other prospective customers about your product.

筹集更多资金

Raising More Money

一般的格言是“只有当你不需要的时候才筹集更多的钱”。换句话说,每当你经过绿洲时,都要储备水。一般来说,当你不需要钱时,你可以获得最好的股票价格。如果你无法获得合理的交易,那就拒绝该交易。如果你有耗尽资金的危险(比如“套现”后的六个月内),风险投资家会敲打你的价格,或者一直缠着你,直到你绝望,然后敲打你的价格。您可以通过在资金用完之前筹集资金来避免这种情况。

The general adage is “raise more money only when you don’t need it.” In other words, whenever you pass an oasis, stock up on water. In general, you can get the best price for your stock when you don’t need the money. If you can’t get a reasonable deal, then turn down the deal. If you are in danger of running out of money (say within six months of “cash out”), the VCs will hammer you on price or string you along until you are desperate and then hammer you on price. You can avoid this by raising money well before you run out.

两个VC

Two VCs

在我看来,如果你有一个风险投资家为你的公司提供资金,你就会得到一个老板。如果你有两个,你就会有一个董事会。如果可以的话,我更喜欢两个。

In my opinion, if you have a single VC funding your company, you get a boss. If you have two, you get a Board of Directors. I much prefer two, if you can swing it.

公司控制

Company Control

除非你是独角兽,否则风投将控制公司。他们总是确保在紧要关头他们能够投票超过你。因此,要习惯你为他们服务的事实。再次强调,拥有一位可以合作的风险投资家的重要性怎么强调都不为过。此外,从根本上来说,你的成年人正在经营公司,因此风险投资实际上是在支持那个人。如果你和你的成年人闹翻了,你就会被扔到公共汽车下。

Unless you are a unicorn, the VCs will control the company. They invariably make sure they can outvote you if push comes to shove. Hence, get used to the fact that you serve at their pleasure. Again, I can’t overemphasize the importance of having a VC you can work with. Also, fundamentally your adult is running the company, so the VCs are actually backing that person. If you have a falling out with your adult, you will get thrown under the bus.

如果(或当)你的成年人厌倦了经营一家初创公司的苦差事,最糟糕的问题就会发生。如果他或她退出,那么你和风险投资公司将招募一位新的首席执行官。我的经验是,这并不总是顺利。在两个案例中,风投们基本上坚持选择一位与我相处不融洽的新首席执行官。在这两种情况下,这都导致我离开公司。

The worst problems will occur if (or when) your adult tires of the grind of running a startup. If he or she exits, then you and the VCs will recruit a new CEO. My experience is that this does not always go well. In two cases, the VCs essentially insisted on a new CEO with whom I did not get along. In both cases, this resulted in my departure from the company.

保密

Secrecy

有些公司对其产品的工作原理非常保密。在我看来,这通常会适得其反。您的公司通过比竞争对手更快的创新来获胜。如果你做不到这一点,你就完蛋了。在我看来,你应该害怕其他初创公司,他们通常非常聪明,不需要向你学习。大公司通常行动缓慢,而你的竞争优势却比他们行动得更快。在我看来,保密并不是一个好主意,因为它无助于你的竞争地位,并且会阻止商业媒体报道你。

Some companies are very secretive about how their products work. In my opinion, this is usually counterproductive. Your company wins by innovating faster than the competition. If you ever fail to do this, you are toast. In my opinion, you should fear other startups, who are generally super smart and don’t need to learn from you. Large companies typically move slowly, and your competitive advantage is moving more quickly than they do. In my opinion, secrecy is rarely a good idea, because it doesn’t help your competitive position and keeps the trade press from writing about you.

销售量

Sales

令我烦恼的是,在任何初创公司中,薪酬最高的高管总是销售副总裁。而且,他或她很少是一个称职的技术专家,所以你必须给这个人搭配一个技术销售工程师(所谓的四足销售团队)。这个人带来的技能是能够解读潜在客户公司的政治茶叶并让客户人员喜欢他们。这就是他们赚大钱的原因!

It never ceases to gall me that the highest-priced executive in any startup is invariably the VP of Sales. Moreover, he or she is rarely a competent technologist, so you have to pair this person with a technical sales engineer (the so-called four-legged sales team). The skill this person brings to the table is an ability to read the political tea leaves in the companies of potential customers and to get customer personnel to like them. That is why they make the big bucks!

我看到企业家犯的最常见的错误是过快地建立销售组织。只有当你的成年人完全超负荷时才雇用销售人员,并且雇用他们的速度非常非常慢。销售人员很少能在第一年完成配额,因此运送不交付的销售人员的费用是巨大的。

The most prevalent mistake I see entrepreneurs make is to build out a sales organization too quickly. Hire sales people only when your adult is completely overloaded and hire them very, very slowly. It is rare for a sales person to make quota in Year One, so the expense of carrying salespeople who are not delivering is dramatic.

其他错误

Other Mistakes

我还想提一下另外两个错误。

There are two other mistakes I would like to mention.

首先,企业家常常低估完成任务的难度。因此,他们常常对完成某件事需要多长时间过于乐观。正如我上面提到的,将您请求的金额加倍作为缓解这种现象的方法。此外,系统软件(本质上我的整个职业生涯都在其中度过)是出了名的难以调试和“生产就绪”。因此,初创公司常常在生产出可销售的产品之前就耗尽了资金。这通常是灾难性的。在最好的情况下,你需要在非常不利的情况下筹集更多的资金。

First, entrepreneurs often underestimate how hard it is to get stuff done. Hence, they are often overoptimistic about how long it will take to get something done. As I noted above, double the amount of money you request as a way to mitigate this phenomenon. Also, system software (where I have spent all of my career, essentially) is notoriously hard to debug and make “production-ready.” As a result, startups often run out of money before they manage to produce a saleable product. This is usually catastrophic. In the best case, you need to raise more money under very adverse circumstances.

第二个错误是试图在产品准备好之前出售产品。客户几乎总是会扔掉效果不佳的产品。因此,你在销售过程中付出了努力,最终的结果是客户不满意!

The second mistake is trying to sell the product before it is ready. Customers will almost always throw out a product that does not work well. Hence, you spend effort in the sales process, and the net result is an unhappy customer!

概括

Summary

尽管创业很伤脑筋,需要大量的鞋皮和大量的工作,并且会经历极度沮丧的时期,但我发现这是我做过的最有价值的事情。你可以看到你的想法被商业化,可以尝试从销售到招聘高管的一切工作,并可以亲眼目睹销售的乐趣和销售失败的痛苦。这是一次非常广阔的经历,也是与为会议撰写枯燥的研究论文相比的一个巨大改变。

Although doing a startup is nerve-racking, requires a lot of shoe leather and a lot of work, and has periods of extreme frustration, I have found it to be about the most rewarding thing I have done. You get to see your ideas commercialized, get to try your hand at everything from sales to recruiting executives, and get to see the joy of sales and the agony of sales failures firsthand. It is a very broadening experience and a great change from writing boring research papers for conferences.

1 . 有关该系统的故事请参阅第 21 章。

1. See Chapter 21 for the story of this system.

2 . 其中“i”是某个大数字,“+1”是另一个数字。换句话说,很多摇滚都很有价值。

2. Where “i” is some big number and “+1” is yet another. In other words, a LOT of rock fetches.

8

8

如何创建和运营 Stonebraker 初创公司——真实的故事

How to Create and Run a Stonebraker Startup—The Real Story

安迪·帕尔默1

Andy Palmer1

如果您有志创办一家系统软件公司,那么您应该考虑使用上一章作为指南。

If you have aspirations to start a systems software company, then you should consider using the previous chapter as a guide.

在“如何通过五个(不太)简单的步骤创办一家公司”(参见第 7 章)。Michael Stonebraker 总结了过去 40 年来创立 9 家(迄今为止)数据库初创公司所获得的智慧。

In “How to Start a Company in Five (Not So) Easy Steps,” (see Chapter 7). Michael Stonebraker has distilled the wisdom gained from founding 9 (so far) database startups over the last 40 years.

作为两家 Stonebraker 数据库初创公司(Vertica Systems 和 Tamr)的“指定成人监督”(业务联合创始人兼首席执行官)以及另外三家公司(VoltDB、Paradigm4、2 Goby)的创始董事会成员/顾问,我一直在生活发展十几年来,迈克一直遵循这种方法,应对充满挑战的经济环境和商业、技术和社会的彻底变化。

As the “designated adult supervision” (business co-founder and CEO) in two Stonebraker database startups (Vertica Systems and Tamr) and a founding BoD member/advisor to three others (VoltDB, Paradigm4,2 Goby), I have lived and developed this approach with Mike through more than a dozen years marked by challenging economic climates and sweeping changes in business, technology, and society.

我很荣幸有机会与迈克合作开展如此多的项目。我们的关系对我作为创始人、软件企业家和个人产生了深远的影响。

I am privileged to have had the opportunity to partner with Mike on so many projects. Our relationship has had a profound effect on me as a founder, as a software entrepreneur, and as a person.

在本章中,我将尝试回答以下问题:“与 Mike Stonebraker 一起经营一家公司感觉如何?”

In this chapter, I’ll try to answer the question: “What’s it like running a company with Mike Stonebraker?”

图像

图 8.1   Andy Palmer 和 Mike Stonebraker 在 2014 年图灵奖颁奖典礼上,2015 年 6 月 20 日。照片来源:Amy Palmer。

Figure 8.1  Andy Palmer and Mike Stonebraker at the 2014 Turing Award Ceremony, June 20, 2015. Photo credit: Amy Palmer.

非凡的成就。非凡的贡献。

An Extraordinary Achievement. An Extraordinary Contribution.

对于计算机科学家来说,创办九家初创公司是一项非凡的成就。正如 2014 年 ACM 图灵奖颁奖词所指出的:

Founding nine startup companies is an extraordinary achievement for a computer scientist. As the 2014 ACM Turing Award citation noted:

“斯通布雷克是唯一一位从事过如此规模的连续创业的图灵奖获得者,这让他对学术界有着独特的视角。尽管数理逻辑对现代数据库管理系统做出了基础性贡献,但理论与实践的联系在数据库研究中经常引起争议。”

“Stonebraker is the only Turing award winner to have engaged in serial entrepreneurship on anything like this scale, giving him a distinctive perspective on the academic world. The connection of theory to practice has often been controversial in database research, despite the foundational contribution of mathematical logic to modern database management systems.”

所有这些公司都处于学术界和商业界的交叉点:迈克的久经考验的公式 [Palmer 2015b]。他们说服各行业的企业提供大型工作数据集——这是一项成就其本身就是帮助将学术研究和理论转化为可大规模运行的突破性软件。

All of these companies were at the intersection of the academic and commercial worlds: Mike’s tried-and-true formula [Palmer 2015b]. They involved convincing businesses across industries to provide their large working datasets—an achievement in itself—to help turn academic research and theory into breakthrough software that would work at scale.

迈克不仅在创办这些公司方面开辟了新天地,而且在他创办这些公司的方式上也开辟了新天地。迈克的方法非常适合将最好的学术思想付诸实践。

Mike broke new ground not just in starting these companies, but also in how he has started them. Mike’s methods are uniquely suited to bringing the best academic ideas into practice.

他专注于解决现实生活中“无法解决”的大问题,而不是对当前解决方案进行渐进式改进。渐进主义往往是数据库软件供应商的现状,这对商业和行业造成了损害。增量主义如何困扰商业数据库系统的一个很好的例子是 20 世纪 80 年代和 90 年代“面向行”数据库系统和增量增强(例如物化视图)的延续。Mike 的“单一尺寸并不适合所有数据库系统”论文 [Stonebraker 和 Çetintemel 2005] 帮助改变了现状并激发了数据库系统创新的扩散,包括 Vertica 的面向列的分析数据库(被 HP (Hewlett- Packard),现在是 Micro Focus 软件组合的一部分)。当今市场上以目的为导向的数据库数量之多、种类之多,证明了迈克不愿满足于渐进主义。

He focused on solving big “unsolvable” real-life problems versus making incremental improvements on current solutions. Incrementalism has all too often been the status quo among database software vendors, to the detriment of business and industry. A great example of how incrementalism plagued commercial database systems was the perpetuation of “row-oriented” database systems and incremental enhancements, such as materialized views, in the 1980s and 1990s. It took Mike’s “One size does not fit all in database systems” paper [Stonebraker and Çetintemel 2005] to help shake things up and inspire the proliferation of innovation in database systems, including Vertica’s column-oriented analytics database (acquired by HP (Hewlett-Packard) and now part of Micro Focus’ software portfolio). The sheer number and variety of purpose-oriented databases on the market today is a testament to Mike’s reluctance to settle for incrementalism.

他采用严格的、工程驱动的方法来进行系统软件设计和开发。Mike 始终清楚地了解学术代码行、初创公司首次发布代码、初创公司第二次发布代码以及实际可广泛运行的代码之间的差异。他认识到构建坚如磐石、可靠且可扩展的东西是多么困难。当需要构建商业产品时,他能够搁置大部分学术傲慢,尊重构建真实系统的要求和成本。他还招募了优秀的人才,他们有耐心和纪律,花费数年的工程时间来构建一个伟大的系统(即使他自己没有耐心)。迈克因提及SMOC而臭名昭著,或“简单的代码问题”。但他始终知道,在制定出核心设计/算法后,需要花费许多人年(或几十年)的时间来编写代码,以及如何通过战略招聘和坚持正确的方法来最佳地实现这一点。

He takes a disciplined, engineering-driven approach to systems software design and development. Mike always had a clear understanding of the differences between an academic code line, startup first release code, startup second release code, and code that would actually work at scale very broadly. He recognized how hard it was to build something rock solid, reliable, and scalable. He was able to suspend much of his academic hubris when it came time to build commercial product, respecting the requirements and costs of building real systems. He also recruited great people who had the patience and discipline to spend the years of engineering required to build a great system (even if he didn’t have the patience himself). Mike is notorious for referring to SMOC, or “a Simple Matter of Code.” But he always knew that it took many person-years (or decades) to write the code after working out the core design/algos—and how to achieve this optimally through strategic hiring and sticking to the right approach.

他采取伙伴关系的领导方式。在我与迈克共事期间,他采用了一种合作伙伴关系的方式,他相信成功取决于出色的技术和出色的业务/销售技能,而这些技能通常不会在一个领导者身上共存。我一直发现他积极地想成为合作伙伴,共同努力创造伟大的事业。他相信,通过合作和公开交换强烈的意见,合作伙伴可以为项目或公司找到最好的答案。迈克和我喜欢互相争论。一起解决难题很有趣。不管是业务问题还是技术问题问题,我们喜欢解决它。我认为我们每个人私下里都喜欢谈论对方的专业领域。

He takes a partnership approach to leadership. During the time I’ve worked with him, Mike has embraced a partnership approach, believing that success depends on both great technology and great business/sales—skills that don’t typically coexist in one leader. I’ve always found him to actively want to be a partner, to work together to build something great. He believed that, through partnership and the open exchange of strong opinions, partners can get to the best answers for the project or company. Mike and I love to argue with each other. It’s fun to go at a hard problem together. It doesn’t matter if it’s a business problem or a technical problem, we love to hash it out. I think that each of us secretly likes talking about the other’s area of expertise.

迈克在创办新公司方面的记录比大多数风险投资家都要好。迄今为止,在他的初创公司中,有 3 家为投资者带来了超过 5 倍的回报(Ingres、Illustra/Postgres 和 Vertica),3 家的回报率较低,还有 3 家仍在“飞行中”。蓝筹股和领先科技公司(例如 Facebook [Palmer 2013]、Uber、Google、微软和 IBM)以及尖端主流工业公司(例如 GE 和汤森路透)都使用 Stonebraker 公司/项目的产品。近三十年来,世界各地每时每刻都在使用受 Stonebraker 启发的产品。3

Mike’s track record in starting new companies is better than most venture capitalists. Of his startups to date, three have delivered greater than 5× returns to their investors (Ingres, Illustra/Postgres, and Vertica), three delivered less, and three are still “in flight.” Blue-chip and leading technology companies—such as Facebook [Palmer 2013], Uber, Google, Microsoft, and IBM as well as cutting-edge mainstream industrial companies—such as GE and Thomson Reuters—use products from Stonebraker companies/projects. Stonebraker-inspired products have been in use for every second of every day across the world for nearly three decades.3

出于这些原因,迈克的足迹远远超出了他创办的公司、他所教的学生以及他启发的学术合作者。科技行业和商业界从他身上学到了很多东西。

For these reasons, Mike’s footprint extends far beyond the companies he founded, the students he taught, and the academic collaborators he inspired. The technology industry and the business world have learned a lot from him.

共同利益的问题 快乐的发现

A Problem of Mutual Interest A Happy Discovery

2004 年,乔·探戈(Jo Tango)介绍我和迈克认识,乔·探戈4是一位风险投资家,当时就职于 Highland Capital Partners。我们参加了在西弗吉尼亚州 Greenbrier 度假村举行的高地活动。我们的妻子首先见面并一拍即合:以至于她们告诉我们必须一起创办一家公司。我们的会议提出了一个共同感兴趣的问题,随后我们高兴地发现我们确实相处得很融洽,并且以同样的方式看待初创公司的事情。(这对于维护婚姻幸福来说是一件好事。)

Mike and I were introduced in 2004 by Jo Tango,4 a venture capitalist who was at Highland Capital Partners at the time. We were at a Highland event at the Greenbrier resort in West Virginia. Our wives met first and hit it off: so much so that they told us that we had to start a company together. Our meeting surfaced a problem of mutual interest, followed by a happy discovery that we really got along and saw startup stuff the same way. (Good thing for the preservation of marital bliss.)

当时我是 Infinity Pharmaceuticals 的首席信息官和行政官,我们刚刚尝试使用 Oracle RAC 实施大型数据仓库(这一努力基本上失败了)。Mike 正在麻省理工学院从事便利店项目 [Stonebraker 等人。2005a],一个横向扩展的面向列的数据库。我亲身经历了迈克试图通过便利店解决的问题,并开始考虑再创业一次。

I was then chief information and administrative officer for Infinity Pharmaceuticals, and we had just finished trying to implement a large data warehouse using Oracle RAC (an effort that essentially failed). Mike was working on the C-Store project at MIT [Stonebraker et al. 2005a], a scale-out column-oriented database. I had experienced firsthand the problems that Mike was trying to address with C-Store and was starting to think about doing another startup.

然而,真正促成这笔交易的是迈克和我拥有相同的核心价值观,尤其是创办公司的合作方式。我们都相信,最好的公司往往是通过相互欣赏、在自己的领域具有谦逊感、并寻求他人帮助他们做出最佳决策以实现最佳结果的人们之间的伙伴关系而建立的。我们于 2005 年创立了 Vertica(见图8.2)。

What really sealed the deal, however, is that Mike and I shared the same core values, particularly the partnership approach to founding a company. We both believed that the best companies are often made through partnerships between people who appreciate each other, have a sense of humility in their own areas, and look to others to help them make the best decisions to achieve the best outcomes. We founded Vertica in 2005 (see Figure 8.2).

图像

图 8.2  在创建 Vertica 时,我们问了自己几个问题。

Figure 8.2  In founding Vertica, we asked ourselves several questions.

伙伴关系的力量

The Power of Partnership

我们的合作伙伴关系多年来一直有效,因为我们的原则和核心价值观是一致的。我们创办公司的方法基于实用主义和经验主义,包括:

Our partnership has worked over the years because our principles and core values are aligned. Our approach to starting companies is based in pragmatism and empirics, including:

1. 首先关注伟大的工程;

1.  a focus on great engineering, first and foremost;

2. 构建旨在解决实际客户问题的产品 [Palmer 2013];

2.  build products designed to solve real customer problems [Palmer 2013];

3. 提供深思熟虑的系统架构:在“空白空间”工作,而不是逐步重新发明众所周知的轮子;

3.  deliver thoughtful system architecture: work in the “white space” instead of incrementally reinventing the proverbial wheel;

4. 聘用最优秀的人才(包括“超越简历”)并尊重他人(在门口检查头衔和学位);

4.  hire the best people (including “looking beyond the resume”) and treat people respectfully (check titles and degrees at the door);

5. 尽早并经常与许多真实客户一起测试所有假设;

5.  test all assumptions with many real customers, early and often;

6. 即使在“繁荣”时期,也应采取资本效率高的方法,并在寻求初始资本时考虑整个生命周期(超越第一轮/下一个里程碑);和

6.  take a capital-efficient approach, even in “flush” times, and consider the full lifecycle when seeking initial capital (think beyond the first round/next milestone); and

7.在这个过程中享受乐趣。

7.  have fun in the process.

Vertica 的成功源自我们对这些核心价值观的坚持以及许多人多年来的辛勤工作,尤其是 Colin Mahony,他仍然在 Micro Focus 领导 Vertica;领导工程十多年的 Shilpa Lawande 和核心架构师 Chuck Bear。

Vertica’s success derived from our adherence to those core values and the hard work of many people over many years—most notably Colin Mahony, who still leads Vertica at Micro Focus; Shilpa Lawande, who led engineering for 10-plus years, and Chuck Bear, who was the core architect.

Vertica 的许多早期潜在客户都表达了对特定功能(具体化视图)的兴趣。这是客户提出不符合自己长期最佳利益的要求的典型例子。迈克说我们不可能构建物化视图;他的原话是“在我的尸体上”——另一种迈克主义。一家拥有自上而下、以销售为导向的首席执行官的初创公司可能只会指导工程人员根据客户的要求构建物化视图。相反,我们对此进行了有益的辩论,但决定不这样做,事实证明这绝对是正确的事情。另一种看待这一问题的方式是,Vertica 的整个系统是物化视图的集合——我们称之为“投影”。随着时间的推移,我们的客户开始意识到他们不需要实现物化视图:

Many of Vertica’s early prospective customers had expressed interest in a specific feature (materialized views). This was a classic example of customers asking for things that aren’t in their own long-term best interests. Mike said that there was no way we were going to build materialized views; his exact words were “over my dead body”—another Mike-ism. A startup with a top-down, sales-oriented CEO would have probably just directed engineering to build materialized views as requested by customers. Instead, we had a healthy debate about it, but decided not to do it, which turned out to be absolutely the right thing. Another way to view this is that Vertica’s entire system is a collection of materialized views—what we called “projections.” And over time, our customers came to appreciate the fact that they didn’t need to implement materialized views: they just needed to use Vertica.

另一个例子是定价:我相信我们需要一个创新的定价模型来区分我们的产品。我们没有采用传统的定价模型(当时基于每个服务器或每个 CPU),而是采用了与人们加载到系统中的数据量(每 TB)相关的定价模型。(这一想法归功于产品营销专家安迪·埃利科特(Andy Ellicott)。)这在当时是违反直觉的。迈克可能会说:“让我们选择对客户来说容易的东西。” 但我非常有信心我们能够成功,而且这是正确的做法,我们对此进行了讨论。迈克支持这个想法,最终证明这是对的。随后,大部分分析数据库市场转向每 TB 定价模型。

Another example was pricing: I believed that we needed an innovative pricing model to differentiate our product. Instead of the conventional pricing model (which was then based on per-server or per-CPU), we went with a pricing model tied to the amount of data that people loaded into the system (per-terabyte). (Credit goes to product marketing pro Andy Ellicott for this idea.) This was counterintuitive at the time. Mike probably would have said, “Let’s choose something that’s easy for the customers.” But I was really confident that we could pull this off and that it was right thing to do, and we talked it through. Mike supported the idea and in the end it was right. Much of the analytical database market subsequently migrated toward the per-terabyte pricing model.

创业5的合作伙伴模式需要付出努力,但事实证明,在我们共同创立的公司中,这是非常值得的。

The partnership model of entrepreneurship5 takes work, but it’s proven well worth it in the companies that we’ve co-founded.

找到合适的合作伙伴后,签订一份简单但牢固的合作协议至关重要,其中包括:(1)对创建公司的整个过程的相互理解;(2) 愿意在其发展的每个阶段挑战假设;(3)公司技术和业务双方相互欣赏和尊重,防止相互指责;(4) 明确核心价值观的一致性。您应该接受并计划与有效相关的间接费用共同领导,包括进行健康辩论和解决分歧的时间。多年来,迈克和我进行了一些相当激烈的讨论,所有这些都有助于建立更好的公司和更加牢固的关系。

After finding the right partner, it’s vital to have a simple but firm partnership agreement, including: (1) a mutual understanding of the entire journey to build the company; (2) a willingness to challenge assumptions at each stage of its development; (3) a mutual appreciation and respect for both the technical and business sides of the company to prevent fingerpointing; and (4) and explicit alignment on core values. You should accept and plan for the overhead associated with effective shared leadership, including the time for healthy debates and navigating disagreements. Mike and I have gotten into some pretty heated discussions over the years, all of which made for better companies and an ever-stronger relationship.

良好的联合创始人关系更加灵活,也更加有趣。创业本来就很艰难:合作伙伴可以帮助你度过困难时期,而在美好时光与某人一起庆祝肯定会更有趣。初创公司也在不断变化,但有了互补的联合创始人,就更容易适应变化。它就像一个内置的支持小组。

Great co-founder relationships are more flexible—and more fun. Startups are inherently hard: a partner can help you through the tough times, and it’s definitely more fun to celebrate with someone during the good times. Startups are also constantly changing, but with a complementary co-founder, it’s much easier to roll with the changes. It’s like a built-in support group.

与此同时,迈克和我也并非志同道合。我们都有很多兴趣,并且总是在我们感兴趣的领域关注很多不同的事物。我对很多不同的业务都有课外兴趣,迈克对各种不同的项目和技术也有兴趣。这为我们的关系和我们共同开展的项目带来了宝贵的外部视角。这就是为什么,再加上我们对彼此的深深信任,我们可以做许多不同的单独项目,但总是能回到一起,准确地继续我们的联合项目。

At the same time, Mike and I aren’t joined at the hip. We both have lots of interests and are always looking at lots of different things in our areas of interest. I have extracurricular interests in a whole bunch of different businesses, and Mike has interests in all kinds of different projects and technology. This brings invaluable outside perspective into our relationship and the projects we work on together. This is why—coupled with our deep trust in each other—we can do many different separate projects but always come back together and pick up exactly where we left off on our joint projects.

我们的相似之处在于,我们都希望获得大量数据,并且感谢外部反馈。我们对更好的想法持开放态度,无论它们来自哪里。

We’re alike in that we both want to get lots of data and we appreciate external feedback. We’re open to better ideas, regardless of where they come from.

共同的价值观和彼此的深度信任可以节省大量时间。在 Vertica 历史的早期,我们就开始筹集资金。我们的融资条件很激进,但我们有信心我们正在做一些事情。我们在一家大型知名风险投资公司的办公室里。有一位初级风险合伙人希望我们承诺一定的收入和增长数字,但这确实处于公司生命周期的早期。迈克和我互相看了一眼,然后说道:“不,我们不会对纯粹虚构的数字做出承诺。如果你不喜欢它,那就不要投资。我们会穿过沙山路,从别人那里拿钱。” 这正是我们所做的。

Shared values and deep trust in each other saves a lot of time. Early in the history of Vertica, we were out raising money. Our financing terms were aggressive, but we were confident that we were on to something. We were in the offices of a large, well-known venture capital firm. There was a junior venture partner who wanted us to commit to certain revenue and growth numbers, but it was really early in the lifecycle of the company. Mike and I glanced at each other and then said, “No, we’re not going to make commitments to numbers that are pure fiction. If you don’t like it, then don’t invest. We’ll go right across Sand Hill Road and take money from somebody else.” That’s exactly what we did.

强烈的实用主义、坚定不移的清晰、无限的能量

Fierce Pragmatism, Unwavering Clarity, Boundless Energy

迈克有一种强烈的实用主义精神,我也认同这一点,并且这种实用主义渗透到了我们的项目中。虽然我们喜欢解决真正复杂、复杂的技术问题,但我们将其与我们如何创办和运营这些公司的非常务实的方法结合起来。例如,从 Vertica 一开始,我们就同意我们的系统必须比替代方案至少快 10 倍,并且便宜 50% 以上。如果在项目的任何时候我们无法实现这一目标,我们就会将其关闭。事实证明,Vertica 系统速度提高了 100 倍,这要归功于 Vertica 的工程团队。(你们好棒。)

There’s a fierce pragmatism to Mike, which I share and which permeates our projects. While we like to work on really sophisticated, complex, technical problems, we pair that with a very pragmatic approach to how we start and run these companies. For example, from the beginning of Vertica, we agreed that our system had to be at least 10× faster and 50-plus% cheaper than the alternatives. If at any point in the project we hadn’t been able to deliver on that, we would have shut it down. As it turned out, the Vertica system was 100× faster, a credit to the engineering team at Vertica. (You guys rock.)

迈克也以非常清晰的方式看待事物。例如:公司走出去开发新技术相对容易,但这并不意味着他们应该这样做。当 Mike 在 2000 年代末对 MapReduce 提出著名批评时 [DeWitt and Stonebraker 2008, Stonebraker et al. 2010],这是有争议的。但在那时,整整一代工程师都在重建已经在学术上或商业上解决过的问题:他们只需要做更多的研究。看到人们一遍又一遍地犯同样的错误是令人沮丧的。今天,因为迈克,有一大群人在做商业上的事情,他们不会再犯同样的错误。我总是告诉刚开始接触数据库的年轻工程师阅读大红书(数据库系统读物,第五版)。

Mike also sees things in a very clear way. For example: It’s relatively easy for companies to go out and build new technology, but that doesn’t mean they should. When Mike famously criticized MapReduce back in the late 2000s [DeWitt and Stonebraker 2008, Stonebraker et al. 2010], it was controversial. But at that point, there was a whole generation of engineers rebuilding things that had already been worked out either academically or commercially: they just had to do a bit more research. It’s frustrating to watch people make the same mistakes over and over again. Today, because of Mike, there are a whole bunch of people doing things commercially who are not making the same mistakes twice. I always tell young engineers starting out in databases to read the Big Red Book (Readings in Database Systems, 5th ed.)

迈克的精力——体力和智力——是无穷无尽的。他总是给我很大的压力,尽管我年轻了 20 岁,但我仍然在努力跟上。这种活力、清晰和务实融入了我们各个层面的业务原则。这里还有一些例子。

Mike’s energy—physical and intellectual—is boundless. He’s always pushing me hard, and I’m still struggling to keep up even though I’m 20 years younger. This energy, clarity, and pragmatism infuses the principles of our businesses at every level. Here are some more examples.

我们的第一条原则(见列表)是“专注于伟大的工程”。这并不意味着要从一家大公司聘请一位几十年来没有编写过代码的经验丰富的工程副总裁。它以真正的工程师开始和结束,他们了解事物为何工作以及它们应该如何在伟大的系统中工作。我们喜欢雇佣那些首先要编写代码、其次要进行管理的工程领导者,而且我们不会拘泥于头衔或简历。(如果他们是不情愿的管理者,这是一个好兆头;如果他们想要管理而不是编写代码,那通常是一个坏兆头。)

Our #1 principle (see list) is “focus on great engineering.” This doesn’t mean hiring a seasoned engineering VP from a big company who hasn’t written code in decades. It starts and ends with real engineers who understand why things work and how they should work in great systems. We like hiring engineering leaders who want to write code first and manage second, and we don’t get hung up on titles or resumes. (If they are reluctant managers, that’s a good sign; if they want to manage rather than write code, that’s usually a bad sign.)

在成功顶住 Vertica 董事会“聘请一名真正的工程副总裁”的压力后,我们提拔了令人惊叹的 Shilpa Lawande 来负责工程工作。Shilpa 比任何人都更了解 Vertica 需要构建的内容。提拔希尔帕是有风险的,因为这是她第一次担任领导职务的机会,但我们知道她会扼杀这个机会。她不断地为核心系统编写代码,但最终(当工程团队达到临界规模时),她挺身而出,并如预期的那样杀死了它,领导了 Vertica 的整个工程团队十多年。(截至 2017 年 8 月,Shilpa 创办了自己的新公司,一家人工智能医疗保健初创公司。)

After successfully resisting BoD pressure at Vertica to “hire a real VP of engineering,” we promoted the amazing Shilpa Lawande to run engineering. Shilpa had a more integrated view of what needed to be built at Vertica than anyone. Promoting Shilpa was a risk as it was her first leadership opportunity, but we knew she was going to kill it. She kept writing code for the core system, but eventually (when the engineering team got to critical mass), she stepped up to the plate—and killed it, as predicted, leading the entire engineering team at Vertica for over a decade. (As of August 2017, Shilpa had started her own new company, an AI healthcare startup.)

Vertica 工程驱动战略中的另一个关键“迈克时刻”是迈克引进了著名计算机科学家 David DeWitt。Dave 在系统设计方面提供了宝贵的建议,并与我们的工程团队一起深入实践。

Another pivotal “Mike moment” in Vertica’s engineering-driven strategy was when Mike brought in noted computer scientist David DeWitt. Dave provided invaluable advice in system design and got right down in the trenches with our engineering team.

另一个核心原则是深思熟虑的系统架构,而迈克是我们在这方面的王牌。通过了解磁盘上的位级别系统如何工作以及在过去 40 多年里,Mike 一遍又一遍地看到设计模式(好的和坏的),他几乎本能地知道新系统应该如何工作。

Another core principle is thoughtful system architecture, and Mike is our ace in the hole on this one. By knowing to the bits-on-disks level how systems work and having seen design patterns (good and bad) over and over again for the past 40-plus years, Mike knows almost instinctively how a new system should work.

例证:许多构建高性能数据库系统的 Vertica 竞争对手认为他们需要构建自己的操作系统(或至少是处理 I/O 的方法)。早些时候,我们决定使用 Linux 文件系统。由 Mike 领导的学术创始团队相信 Linux 已经发展到“足够好”,可以作为下一代系统的基础。

Case in point: A lot of Vertica competitors building high-performance database systems believed that they needed to build their own operating systems (or at least methods to handle I/O). Early on, we made a decision to go with the Linux file system. The academic founding team led by Mike believed that Linux had evolved to be “good enough” to serve as a foundation for next-generation systems.

从表面上看,这似乎是一个很大的风险:现在,我们不仅在商品硬件上运行(Netezza 正在构建自己的硬件),而且还在商品(开源)操作系统上运行。迈克更清楚:这是他本能的决定,但却是正确的决定,而且影响力很大。如果我们走另一条路,Vertica 的结果将会非常不同。

On the surface, it seemed like a big risk: Now, we were not only running on commodity hardware (Netezza was building its own hardware), but also running on a commodity (open source) operating system. Mike knew better: an instinctive decision for him, but the right one and highly leveraged. Had we gone the other way, the outcome of Vertica would be very different.

幸运的是,Mike 说服我不要将 Vertica 定位为 MySQL 的替代高性能/面向读取的存储引擎。这个想法是,在 InnoDB 被 Oracle 收购后6,MySQL 社区正在寻找一种替代的高性能、面向读取的存储引擎,该引擎将位于 MySQL 堆栈其余部分的下方,用于 SQL 解析、优化等。迈克说这永远不会起作用:如果没有堆栈顶部(声明性语言)向下的控制,高性能应用程序将无法工作。(又说对了,迈克。)

Fortunately, Mike talked me down from positioning Vertica as an alternative high-performance/read-oriented storage engine for MySQL. The idea was that, following InnoDB’s acquisition by Oracle,6 the MySQL community was looking for an alternative high-performance, read-oriented storage engine that would sit underneath the rest of the MySQL stack for SQL parsing, optimization, and so on. Mike said it would never work: high-performance applications wouldn’t work without control from the top of the stack (the declarative language) on down. (Right again, Mike.)

但迈克并不总是对的,而且他并不害怕犯错。不过,他的击球率相当不错。众所周知,他很早就拥抱了分布式系统,他的早期工作(1970 年代)就基于他的信念。该行业花了 40 多年的时间才接受这些事情。他从来不喜欢使用高性能内存架构,他在这一点上错了。迈克的一个标志——也是他成功的关键——是他从不害怕对事物(通常是他一无所知的事物)提出强烈的意见,只是为了鼓励辩论。有时辩论非常富有成果。

But Mike isn’t always right and he’s not afraid of being wrong. His batting average is pretty good, though. He was famously early in embracing distributed systems, basing his early, early work (1970s) on his belief. It’s taken 40-plus years for the industry to come around to these kinds of things. He was never a fan of using high-performance memory architectures, and he was wrong on that. A hallmark of Mike—and a key to his success—is that he’s never afraid to have strong opinions about things (often things he knows nothing about) just to encourage debate. Sometimes very fruitful debates.

迈克就是一个很好的例子,他相信聪明人永远都是聪明人。风险投资家和其他倾向于把他归类为“只是一名学者”的人大大低估了他。就像麻省理工学院生物技术领域的罗伯特·兰格一样,如果一个人在很多层面上都非常非常聪明,就不可能创办这么多公司并在商业和学术上取得如此成功。

Mike’s a great example of the belief that smart people are smart all the time. Venture capitalists and others who have tended to pigeonhole him as “just an academic” vastly underestimated him. Like MIT’s Robert Langer in biotech, one doesn’t start this many companies and have this kind of success commercially and academically without being really, really smart on many levels.

图像

图 8.3   2015 年 7 月,迈克·斯通布雷克 (Mike Stonebraker) 和他的蓝草乐队“Shared Nothing”(还有什么?)在温尼珀索基湖的 Tamr 夏季郊游中进行表演。左起依次为 Mike、软件工程师 John“JR”Robinson(Vertica 的)和Stan Zdonik 教授(布朗大学)。照片来源:珍妮丝·布朗。

Figure 8.3  Mike Stonebraker and his bluegrass band, “Shared Nothing” (what else?), entertain at the Tamr Summer Outing on Lake Winnipesaukee in July 2015. From left are Mike, software engineer John “JR” Robinson (of Vertica), and Professor Stan Zdonik (of Brown University). Photo Credit: Janice Brown.

最后的观察:初创企业从根本上讲是关于人的

A Final Observation: Startups are Fundamentally about People

公司来来去去;良好的关系可以天长地久。合作伙伴关系延伸到与您“一起”工作的人,无论是帮助构建早期研究系统的研究生或博士,还是开发和部署商业系统的工程师。

Companies come and go; good relationships can last forever. Partnership extends to the people who work “with” you—whether it’s the graduate students or Ph.D.s who help build early research systems or the engineers who develop and deploy commercial systems.

作为迈克·斯通布雷克 (Mike Stonebraker) 联合创始人,我相信我为与我一起工作的人工作,通过将他们的职业兴趣与“不惜一切代价”结合起来,为他们提供 (1) 比在其他地方找到的更好的职业发展机会。创业成功,(2)比其他地方更健康、更高效、更有趣的工作环境[Palmer 2015a]。

As a Mike Stonebraker co-founder, I believe that I work for the people who work with me, giving them (1) better career development opportunities than they could find elsewhere, by aligning their professional interests with “whatever it takes” to make the startup successful, and (2) a healthier, more productive and more fun work environment [Palmer 2015a] than they can find elsewhere.

迈克拥有如此广泛和多元化的“学术家庭”的原因之一是,他投入了大量的时间、精力和精力来培养年轻人,为他们提供学术和商业机会。这可能是他给世界的最大礼物。

One of the reasons Mike has such a broad and diverse “academic family” is that he invests tremendous time, energy, and effort in developing young people, giving them opportunities both academically as well as commercially. This may be his biggest gift to the world.

1 . 正如迈克·斯通布雷克(Mike Stonebraker)在他的图灵演讲中所承认的那样,“陆鲨就在尖叫箱上。” [碎石者 2016]

1. As acknowledged by Mike Stonebraker in his Turing lecture, “The land sharks are on the squawk box.” [Stonebraker 2016]

2 . 有关 C-Store/Vertica、Tamr、H-Store/VoltDB 和 SciDB/Paradigm4 的研究和开发,请参阅第 27 – 30章。

2. See Chapters 2730 on the research and development involving C-Store/Vertica, Tamr, H-Store/VoltDB, and SciDB/Paradigm4.

3 . 有关视觉证明,请参阅Naumann 的第 13 章和Pavlo 的第 1 部分的图表 2

3. For visual proof, see Chapter 13, by Naumann and Chart 2 of Part 1 by Pavlo

4 . 阅读 Tango 的第 9 章

4. Read Tango’s Chapter 9.

5 . 详细了解合伙人模式的重要性,请阅读第 7 章“如何通过五个(不那么)简单的步骤创办一家公司”(Stonebraker)和第 9 章“让成年人加入进来:风险投资的视角”(Tango)。

5. Read more about the importance of a partnership model Chapter 7 “How to Start a Company in Five (Not So) Easy Steps” (Stonebraker) and 9 “Getting Grownups in the Room: A VC Perspective” (Tango).

6 . 构建 MySQL 存储组件的 InnoDB 被 Oracle 收购,导致开源社区暂时陷入危机。

6. InnoDB, which built the storage component of MySQL, had been acquired by Oracle, causing a temporary crisis in the open-source community.

9

9

让成年人加入进来:风险投资的视角

Getting Grownups in the Room: A VC Perspective

乔探戈1

Jo Tango1

我的第一次会议

My First Meeting

“你们网站的方向都是错误的,”迈克说。“这导致我走错了高速公路出口。”

“Your website’s directions are all wrong,” Mike said. “It led me to take the wrong freeway exit.”

2002 年夏天,我正在研究数据库领域的新发展,Sybase 联合创始人 Bob Epstein 建议我去找一下 Mike Stonebraker,他刚刚从加利福尼亚搬到了波士顿地区。

In the Summer of 2002 I was looking into new developments in the database space, and Sybase co-founder Bob Epstein suggested that I look up Mike Stonebraker, who had just moved from California to the Boston area.

因此,我找到了迈克的电子邮件并联系了他,提议会面。我们的接待员带迈克去了会议室。我走进去说:“你好。”

So, I found Mike’s email and reached out, proposing a meeting. Our receptionist showed Mike to the conference room. I walked in and said, “Hello.”

我看到一个高大的男人,目光锐利。褶皱衬衫和短裤完成了整体造型。

I saw a tall man with penetrating eyes. A wrinkled shirt and shorts completed the ensemble.

迈克的第一句话(上图)让我想到:“这将是一次不同的会议!”

Mike’s first words (above) made me think: “This is going to be a different meeting!”

事实证明,这是许多公司之间长期合作关系的开始:Goby(被 NAVTEQ 收购)、Paradigm4、StreamBase Systems(TIBCO)、Vertica Systems(惠普)和 VoltDB。最重要的是,迈克和我已经成为朋友。

It turned out to be the start of what has been a long working relationship across a number of companies: Goby (acquired by NAVTEQ), Paradigm4, StreamBase Systems (TIBCO), Vertica Systems (Hewlett-Packard), and VoltDB. Most importantly, Mike and I have become friends.

语境

Context

风险投资是一项有趣的业务。你从大学捐赠基金和养老基金等机构筹集资金,然后努力寻找企业家来支持。作为一名早期风险投资人,我寻找新兴技术领域的企业家,他们刚刚组建了需要种子资本的公司。

Venture capital is an interesting business. You raise money from institutions such as college endowments and pension funds, and you then strive to find entrepreneurs to back. Being an early-stage VC, I look for entrepreneurs in emerging technologies just forming companies that warrant seed capital.

风险投资有两个有趣的方面。

There are two interesting facets to venture capital.

首先,它是终极信任游戏。机构投资者承诺投资期限为十年的基金。他们几乎没有办法摆脱承诺,而且没有董事会,因为风险投资公司是作为私人合伙企业运营的。因此,他们只有信任你才会投资。

First, it is the ultimate Trust Game. Institutional investors commit to a fund that is ten years in duration. There are very few ways for them to get out of a commitment, and there is no Board of Directors, as venture firms are run as private partnerships. So, they will invest only if they trust you.

作为风险投资人,你反过来投资企业家,其中许多人都很古怪,比如迈克。你去参加董事会会议,你试图影响战略,但 99% 的时间里,你和企业家不在同一个房间。创始人可能每天都打高尔夫球或从事邪恶行为,而您往往是最后一个知道的。因此,您只投资于您信任的人。

You, as the VC, in turn invest in entrepreneurs, many of whom are quirky, like Mike. You go to board meetings, you try to influence strategy, but for 99% of the time, you and the entrepreneur are not in the same room. A founder can golf every day or engage in nefarious behavior, and you’re often the last to know. So, you only invest in someone whom you trust.

迈克是我信任的人。是的,我相信他。

Mike is someone I trust. Yes, I believe in him.

其次,你的报酬是让你展望未来,并承担由此带来的所有风险和不完美。2002年,我花了很多时间思考“下一步是什么?” 通过工作,我与高盛、摩根士丹利、富达、普特南和 MMC(Marsh & McLennan Companies)的首席信息官和首席技术官建立了关系。

Second, you’re paid to peer into the future with all the risks and imperfection that that entails. In 2002, I spent a lot of time thinking about, “What’s next?” Through my work, I had built relationships with the CIOs and CTOs at Goldman Sachs, Morgan Stanley, Fidelity, Putnam, and MMC (Marsh & McLennan Companies).

在与这些行业领导者的多次一对一对话中,我们谈到了他们的公司面临的紧迫问题,以及如果他们能够“挥舞魔杖”,他们希望看到什么。基于这些见解,我在他们的赞助下启动了一些种子项目,通常是在得到迈克的同意后。

In numerous one-on-one conversations with these industry leaders, we talked about the pressing problems facing their companies and what they would have liked to see if they could “wave the magic wand.” Based on such insights, I started some seed projects with their sponsorship, often after getting Mike’s take.

我注意到存储子系统和网络技术方面发生了很多变化。但是,数据库层并没有发生太多事情。甲骨文似乎已经锁定了这个空间。

I noticed that there was a great deal going on with storage subsystems and networking technologies. But, there wasn’t much going on in the database layer. Oracle seemed to have a lock on the space.

因此,我给鲍勃·爱泼斯坦打电话。我很感激鲍勃建议我联系迈克。

Hence, the call to Bob Epstein. I’m grateful that Bob suggested that I contact Mike.

流库

StreamBase

我们第一次见面几个月后,迈克以“迈克特有的”方式给我发了一封电子邮件:“我有一个新想法——想看一下吗?” 这次,迈克穿着休闲裤,我发现这是一个好兆头!

A few months after our first meeting, Mike emailed me in “characteristic Mike” fashion: “I have a new idea—wanna take a look?” This time, Mike was wearing slacks, which I found to be a good sign!

他有一个假设,即数据库世界将会分裂,“一刀切”。他提议为一家名为“Grassy Brook”的公司筹集一轮小额“种子”融资。我建议我们找一个更好的名字,迈克大笑起来。我喜欢那种笑声,从那以后我已经听过很多次了。

He had a hypothesis that the database world was going to splinter, that “one size doesn’t fit all.” He proposed raising a small round, a “seed” financing for a company called “Grassy Brook.” I proposed that we find a better name, to which Mike responded with a loud and hearty laugh. I love that laugh, and I’ve heard it many times since.

经过一番尽职调查后,Grassy Brook 的创始团队搬进了我办公室楼下的一些空闲空间。早期的工程团队组建完毕。

After doing some due diligence, Grassy Brook’s founding team moved into some free space downstairs from my office. An early engineering team assembled.

在某种程度上,迈克和我开始“生活在一起”。我们开始经常在办公室碰面。我们学会了一起工作。

Mike and I, in a way, started to “live together.” We started to bump into each other at the office fairly often. We learned to work together.

在投资初期,迈克建议我们和我们的配偶在温尼珀索基湖附近吃晚饭,我在那里租了一个避暑别墅两周,而迈克决定在那里买一套位于 Grassy Pond Road 的房子。很高兴见到他的妻子贝丝。

Early in the investment, Mike suggested that we and our spouses grab dinner up near Lake Winnipesaukee, where I was renting a summer place for two weeks and where Mike had decided to buy a house on Grassy Pond Road. It was great to meet Beth, his spouse.

过了一会儿,迈克在那个夏天召集了整个创始团队和董事会。很高兴见到迈克的女儿莱斯利和桑迪。莱斯利制作了第一个公司标志。

A bit later, Mike had the whole founding team and his board up that summer. It was great to meet Mike’s daughters, Leslie and Sandy. Leslie did the first company logo.

图像

几个月后,我通过我的个人网络为迈克找到了一位以商业为导向的首席执行官。

Some months later, I found for Mike a business-oriented CEO through my personal network.

我确实记得早期的一些“电子邮件口水战”。我参加了许多小组电子邮件讨论,有时,有人会说一些迈克不同意的话。然后,就会收到简洁的电子邮件,有时全部大写。

I do remember early on some “email flame wars.” I was part of many group email discussions, and, sometimes, someone would say something with which Mike would disagree. Then, the terse emails, sometimes in all caps, would come.

其他时候,迈克对我或其他风险投资人的言论或行为表示怀疑,他会做出强烈的反应。但是,随着时间的推移,我发现迈克非常稳定。他想听听真相,纯粹而简单。他想测试人们的动机,以及他们是否真正说出自己的意思以及他们是否会按照他们所说的去做。

At other times, Mike was suspicious of what I or the other VCs were saying or doing, and he would react strongly. But, over time, I found that Mike was very consistent. He wanted to hear the truth, pure and simple. He wanted to test people’s motivations and whether they were really saying what they meant and whether they would do what they said.

我对此感到非常满意。迈克很像我的父亲。他是一个讲真话的人。而且,他当然希望其他人也这样做。

I was very comfortable with this. Mike was a lot like my father. He was a Truth Teller. And, he certainly expected others to do the same.

现在,别误会我的意思。结束“迈克消防电子邮件”的浇水可能会令人不愉快。但是,如果你觉得自己是对的并提供数据,他将是第一个改变想法的人。

Now, don’t get me wrong. Being at the end of a “Mike Firehose Email” dousing can be unpleasant. But, if you felt you were right and presented data, he would be the first to change his thinking.

“把我的脸涂成红色,”迈克在一封电子邮件中这样开头,当时他意识到自己的一个假设和断言是错误的。

“Color my face red,” one email from Mike started, when he realized that he was incorrect in one of his assumptions and assertions.

是的,一个说真话的人,一个甚至愿意对自己说真话并让别人对他说真话的人。那时我更加尊重迈克了。这让我想在这家更名为 StreamBase Systems 的公司更加努力地工作。

Yes, a Truth Teller, one even willing to speak truth to himself and to have truth spoken to him. It was then that I respected Mike even more. It made me want to work even harder on the company, which was renamed StreamBase Systems.

图像

剧本已定

A Playbook Is Set

当迈克开始构思 Vertica 时,我觉得他和我已经进入最佳状态。他会让我了解他的想法,我会给他反馈。如果我遇到了好人,我就把他们介绍给迈克。

By the time Mike started to think up Vertica, I felt he and I were in a groove. He would keep me informed of his ideas, and I would give him feedback. If I met good people, I introduced them to Mike.

图像

你看,早期风险投资很像公司生活中的高管招聘人员。将优秀的人联系在一起并观看比赛的发生是工作中非常有意义的部分。包括CEO在内的主要创始高管通过这种方式加入了Stream-Base、Vertica、Goby、Paradigm4和VoltDB。这些是 Stonebraker 公司,我很高兴作为风险投资人参与其中。

You see, an early-stage VC is very much like an executive recruiter in a company’s life. It is an extremely rewarding part of the job to connect together good people and watch a match happen. Key founding executives, including CEOs, joined Stream-Base, Vertica, Goby, Paradigm4, and VoltDB in this way. Those are the Stonebraker companies with which it has been a pleasure to be involved as a VC.

当具有共同价值观的志同道合的人(例如迈克和安迪·帕尔默)一起合作时,好事往往会发生。2

When like-minded people with shared values partner together, such as Mike and Andy Palmer, good things tend to happen.2

所以,我们的剧本是:

So, our playbook is:

• 迈克有好主意。

•  Mike has good ideas.

• 我投资那些让我兴奋的项目,并向高管和潜在客户进行个人介绍。

•  I invest in the ones that make me excited and make personal introductions to executives and potential customers.

• 他在他的避暑别墅举办了一场启动派对。

•  He hosts a kickoff party at his summer house.

• 我们作为合作伙伴共同努力,做对公司有利的事情。

•  We work hard together, as partners, to do what is right for the company.

这是迈克对九家初创公司进行改进后的雏形,他在前一章中将其描述为“五个(不太)简单的步骤”(请参阅​​第7 章)。

This is the embryonic version of what Mike refined over nine startups to become what he described in a previous chapter as “five (not so) easy steps” (see Chapter 7.

我很高兴地指出,随着时间的推移,电子邮件激烈的战争已经基本上不存在了。当我们发生冲突时,迈克和我会拿起电话来解决。但是,现在在这个信任游戏中容易多了,因为我相信我们每个人都赢得了对方的信任。

I’m happy to state that over time, the email flame wars have become largely nonexistent. When we have conflict, Mike and I pick up the phone and talk it out. But, it is a lot easier now, in this Trust Game, for I believe that each of us has earned the other’s trust.

迈克的价值观

Mike’s Values

在董事会会议和电话会议之前和之后,以及在一些晚餐上,我与迈克谈论了个人事情。我了解到他来自低收入家庭,但当普林斯顿大学给他一个名额时,他的生活发生了变化。

Before and after board meetings and calls, and over some dinners, I talked about personal things with Mike. I learned that he was from a low-income background, but that his life changed when Princeton gave him a spot.

我了解到他和贝丝非常认真地支持公益事业,而且他们故意过入不敷出的生活。他们希望谨慎对待孩子们如何看待金钱在生活中的作用。

I learned that he and Beth were very serious about supporting good causes, and they intentionally lived quite below their means. They wanted to be careful with how their children perceived money’s role in life.

当迈克在夏天穿着短裤参加董事会会议时,我仍然会咯咯地笑。“迈克就是迈克,”我经常对别人说。他是独一无二的——而且是以一种非常好的方式。

I still chuckle when Mike in the summer wears shorts to board meetings. “Mike is Mike,” I often say to others. He is one-of-a-kind—and, in a very good way.

通过多次互动,迈克也影响了我的风格。他欣赏人们的直率和透明,因此,在我们的互动中,我不再过滤。我只是告诉他我的想法和原因。在困难的问题上与迈克正面交锋可能有点令人生畏,但我发现他非常讲道理,而且,如果你有数据,他的思想非常开放。

Through many interactions, Mike affected my style, too. He appreciates people being blunt and transparent, and so, in our interactions, I no longer filter. I just tell him what I think and why. It can be a bit intimidating to go toe-to-toe with Mike on difficult issues, but I have found that he is eminently reasonable, and, if you’re armed with data, he is very open minded.

所以,就像一对老朋友一样,有付出有收获,互相尊重。我们确实合作得很好,而且非常信任。

So, like a pair of old friends, there is a give and take and mutual respect. We really do work well together, and there’s much trust there.

为什么有人能获得图灵奖?老实说,我不知道。但是,我觉得我确实知道迈克有独特的能力。他乐于承担风险,与不同的群体合作,并为大胆而有趣的事情而奋斗。他是一位真正的企业家。

Why does one win the Turing Award? Honestly, I do not know. But, I feel I do know that Mike has unique abilities. He is comfortable with taking risks, working with a diverse group of people, and striving for something bold and interesting. He is a true entrepreneur.

尾声

A Coda

2014 年,萨姆·马登 (Sam Madden) 3给我打电话。他正在向迈克扔一份节日文集——为一位麻省理工学院教授的 70 岁生日庆祝活动。我急于成为个人赞助商,安迪·帕尔默、英特尔和微软也是如此。

Sam Madden3 called me up in 2014. He was throwing Mike a Festschrift—a 70th birthday celebration for a professor—at MIT. I rushed to be a personal sponsor, as did Andy Palmer, Intel, and Microsoft.

Sam 一如既往地主持了这个节目,召集了演讲嘉宾,甚至分发了鲜红色的 T 恤。

Sam did his usual job of leading the program, gathering a lineup of speakers and even handing out T-shirts in bright red.

图像

这是一次令人愉快的活动,有一些精彩的谈话,许多微笑,我个人意识到我是在这么多伟大的人中间,他们拥有丰富的头脑和正确的价值观。

It was a joyous event, with some great talks, many smiles, and a personal realization that I was in the midst of so many great people with large brains and sound values.

特别是,很高兴见到贝丝、莱斯利和桑迪,多年前我第一次见到她们时,她们还是小女孩。

In particular, it was great to see Beth, as well as Leslie and Sandy, who had been young girls when I had met them the first time many years ago.

很好的一天

A Great Day

有一次,在 Vertica 的董事会会议结束后,我们不知何故开始谈论图灵奖。作为一名文科专业的学生,​​我问:“那么,这是什么?”

One time after a board meeting for Vertica, we somehow got to talking about the Turing Award. Being a liberal arts major, I asked: “So, what is it?’

“这是最大的荣誉,”迈克说。“据我所知,我过去一直支持这一点,但这不是我喜欢考虑的事情。对于计算机科学领域的人来说,这是一项至高无上的成就。” 他突然安静了下来,低头看着桌子。

“It is the greatest honor,” Mike said. “I’ve been up for it in the past, from what I hear, but it’s not something I like to think about. It is a crowning achievement for someone in computer science.” He suddenly became quiet and looked down at the table.

我不敢再问任何问题。

I didn’t dare ask any more questions.

多年后,当全世界听说迈克获得图灵奖时,我很高兴。我没能参加颁奖典礼,但我在网上读到了他的演讲。最后,他提到了两个人:“母球”和“信徒”。人们告诉我他指的是安迪·帕尔默和我。

Years later, when the world heard that Mike won the Turing Award, I was elated. I wasn’t able to attend the ceremony, but I read his speech online. At the end, he referred to two people, “Cue Ball” and “The Believer.” People told me that he was referring to Andy Palmer and me.

多么荣幸啊……

What an honor .…

迈克,这是一个非常友善和值得赞赏的举动。您一直是一位出色的商业伙伴和朋友。作为父母,您是我的榜样。我爱你,为你感到高兴!

That was a very kind and greatly appreciated gesture, Mike. You’ve been a great business partner and friend. You are a great role model for me as a parent. I love you and am so happy for you!

1 . 迈克·斯通布雷克 (Mike Stonebraker) 在他的图灵讲座中将其称为“信徒”,“陆鲨就在尖叫箱上”[Stonebraker 2016]。

1. Acknowledged as “Believer” by Mike Stonebraker in his Turing lecture, “The land sharks are on the squawk box” [Stonebraker 2016].

2 . 请参阅本节其他章节中的 Mike 和 Andy 的观点,即第 7 章(作者:Michael Stonebraker)和第 8 章(作者:Andy Palmer)。

2. See Mike and Andy’s views in the other chapters in this section, Chapter 7 (by Michael Stonebraker) and Chapter 8 (by Andy Palmer).

3 . 阅读 Sam Madden 的 Michael Stonebraker 传记(第 1 章)以及他按系统介绍 Mike 的研究贡献(第 14 章)。

3. Read Sam Madden’s biography of Michael Stonebraker (Chapter 1) and his introduction to Mike’s research contributions by system (Chapter 14.)

第六部分

PART VI

数据库系统研究

DATABASE SYSTEMS RESEARCH

10

10

好想法从何而来以及如何利用它们

Where Good Ideas Come From and How to Exploit Them

迈克尔·斯通布雷克

Michael Stonebraker

介绍

Introduction

我经常被问到这样的问题:“好的想法从何而来?” 简单的答案是“我不知道”。

I often get the question: “Where do good ideas come from?” The simple answer is “I don’t know.”

就我而言,它们当然不是来自躺在海滩上或在山顶与大自然交流。据我所知,有两个催化剂。第一个是在拥有很多聪明、好斗的人的机构中闲逛。他们有一些想法希望你批评,并且他们可以批评你的想法。在这种反复的过程中,有时会出现好的想法。第二个催化剂是与 DBMS 技术的许多实际用户进行交流。他们很乐意告诉您他们喜欢什么、不喜欢什么以及让他们失眠的问题。与大量用户交谈常常会出现需要解决的问题。从这些问题中有时会产生好的想法。

In my case, they certainly don’t come from lying on the beach or communing with nature on a mountaintop. As near as I can tell, there are two catalysts. The first is hanging around institutions that have a lot of smart, combative people. They have ideas that they want you to critique, and they are available to critique your ideas. Out of this back-and-forth, good ideas sometimes emerge. The second catalyst is talking to a lot of real-world users of DBMS technology. They are happy to tell you what they like and don’t like and the problems that they are losing sleep over. Out of talking to a lot of users often come problems to work on. From such problems, sometimes good ideas emerge.

然而,我认为这些往往是次要影响。我将在本章中回顾我的职业生涯,指出我的想法(好的和坏的)来自哪里。在很多情况下,这纯粹是机缘巧合。

However, I think these are often secondary effects. I will spend this chapter going through my career indicating where my ideas (both good and bad) came from. In a lot of cases, it was pure serendipity.

安格尔的诞生

The Birth of Ingres

1971 年,我作为一名新助理教授来到伯克利。我知道我的论文是一个坏主意,或者充其量是无关紧要的。伯克利雇用我是因为我同意从事“城市系统”的研究。大约在这个时候,美国国家科学基金会 (NSF) 启动了一项名为“研究应用于国家”的计划。需求 (RANN) 和兰德公司应用运筹学 (OR) 思想来定位消防站和分配警察而成为头条新闻。有一段时间我尝试在这个领域工作:我研究了伯克利市法院系统,然后为加利福尼亚州马林县建立了一个土地利用模型。我很快就了解到这些研究有多么困难,并且由于糟糕的数据而陷入了多么的困境。

I arrived at Berkeley in 1971 as a new assistant professor. I knew my thesis was a bad idea or at best irrelevant. Berkeley hired me because I agreed to work on something called “urban systems.” This was around the time that the National Science Foundation (NSF) started a program called Research Applied to the National Needs (RANN), and Rand Corporation was making headlines applying Operations Research (OR) ideas to locating firehouses and allotting police officers. For a while I tried to work in this area: I studied the Berkeley Municipal Court system and then built a land-use model for Marin County, California. I learned quickly how difficult these studies were and how bogged down they became because of bad data.

大约在这个时候,Gene Wong 建议我们看一下数据库。很快,我们就认为 CODASYL 提案难以理解,而且 IMS 的限制性太大。当然,Ted Codd 的论文对我们来说非常有意义,并且开始实施是理所当然的。我们之前都没有编写过任何软件,也没有管理过复杂的项目,但我们并没有因此而气馁。其他几个研究小组大约在同一时间开始了类似的项目。大多数人(包括我们)都进行了足够的跑步,因此他们可以写一篇论文 [Stonebraker 等人。1976b]。由于完全未知的原因,我们坚持了下来,并让 Ingres 工作得相当好,它在 20 世纪 70 年代中期被广泛使用,成为研究人员可以接触到的唯一 RDBMS。1事实上,安格尔产生的影响主要是因为我们坚持不懈并得到了一个真正可以发挥作用的系统。我认为这个决定纯粹是偶然。

At about this time Gene Wong suggested we take a look at databases. In short order, we decided the CODASYL proposal was incomprehensible, and IMS was far too restrictive. Ted Codd’s papers, of course, made perfect sense to us, and it was a no-brainer to start an implementation. We were not dissuaded by the fact that neither of us had ever written any software before nor managed a complex project. Several other research groups embarked on similar projects around the same time. Most (including us) got enough running so they could write a paper [Stonebraker et al. 1976b]. For totally unknown reasons, we persevered and got Ingres to work reasonably well, and it became widely used in the mid-1970s as the only RDBMS that researchers could get their hands on.1 In effect, Ingres made an impact mostly because we persevered and got a real system to work. I view this decision as pure serendipity.

抽象数据类型 (ADT)

Abstract Data Types (ADTs)

Ingres 原型的主要内部用户是 Pravin Varaiya 领导的城市经济学小组,他们对地理信息系统 (GIS) 感兴趣。Angela Go 在 Ingres 之上实现了一个 GIS 系统,但效果不是很好。Varaiya 想要多边形地图和操作,例如多边形内的点和多边形相交多边形。这些对于使用 QUEL 和 SQL 等语言编写的代码来说是可怕的,而且执行性能也很差。

The main internal user of the Ingres prototype was an urban economics group led by Pravin Varaiya, which were interested in Geographic Information Systems (GIS). Angela Go implemented a GIS system on top of Ingres, but it didn’t work very well. Varaiya wanted polygon maps and operations like point-in-polygon and polygon-intersects-polygon. These are horrific to code in languages like QUEL and SQL and execute with dismal performance.

此时,我们已经了解了 Ingres 中整数、浮点数和字符串的工作原理;然而,在 Ingres 类型的系统上模拟点、线和多边形是非常痛苦的。当时似乎很自然地会问:“为什么不扩展 Ingres 中的内置类型呢?” 1982-1983 年,我们在 Ingres 中构建了这样一个系统(OngFS84),它似乎运行得很好。因此,我们在Postgres中使用了同样的想法。在我看来,这是 Postgres 的主要创新(一个好主意),我们接下来将讨论它。2

By this time, we understood how integers, floats, and strings worked in Ingres; however, simulating points, lines, and polygons on the Ingres-type system was very painful. It seemed natural at the time to ask, “Why not extend the built-in types in Ingres?” We built such a system (OngFS84) into Ingres in 1982–1983, and it seemed to work very well. Therefore, we used the same idea in Postgres. In my opinion, this was the major innovation (a good idea) in Postgres, to which we turn next.2

Postgres 3

Postgres3

访问伯克利的游客总是会问:“使用 Ingres 的最大数据库是什么?” 我们总是嘟囔着:“一点也不大。” 可以通过一个例子来说明原因。1978 年,亚利桑那州立大学认真考虑在其涵盖 40,000 名学生的学生记录系统中运行 Ingres。项目团队可以绕开必须运行不受支持的操作系统 (Unix) 和不受支持的 DBMS (Ingres) 的事实,但当 ASU 发现没有适用于 Unix 的 COBOL(他们是一家 COBOL 商店)时,该项目陷入了停滞。 。由于这些原因,基本上任何认真的人都不会考虑安格尔,并且它被降级为适度的应用。大约在同一时间,Larry Ellison 开始声称 Oracle 的速度比 Ingres 快十倍,考虑到 Oracle 甚至还没有工作,这种说法很奇怪。

Visitors to Berkeley always asked, “What is the biggest database that uses Ingres?” We always had to mumble, “Not very big at all.” The reason can be illustrated by an example. In 1978, Arizona State University seriously considered running Ingres for its student records system, which covered 40,000 students. The project team could get around the fact that they had to run an unsupported operating system (Unix) and an unsupported DBMS (Ingres), but the project faltered when ASU found that there was no COBOL available for Unix (they were a COBOL shop). For these reasons, essentially anybody serious would not consider Ingres, and it was relegated to modest applications. At about the same time, Larry Ellison started claiming that Oracle was ten times faster than Ingres, a strange claim given that Oracle didn’t even work yet.

很明显,为了有所作为,我们必须将 Ingres 迁移到受支持的操作系统、提供支持、改进文档、实现报告编写器等等。简而言之,我们必须创办一家商业公司。我不知道如何做到这一点,所以我去找 Jon Nakerud 交谈,他当时是 Cullinet Corp. 的西部地区销售经理,该公司销售 IDMS(一种 CODASYL 系统)。在乔恩(Jon)担任首席执行官的帮助下,我们筹集了风险投资并创办了 Ingres Corp.。

It became obvious that, to make a difference, we had to move Ingres to a supported operating system, offer support, improve the documentation, implement a report writer, and so on. In short, we had to start a commercial company. I had no idea how to do this, so I went to talk to Jon Nakerud, then the Western Region sales manager for Cullinet Corp., which marketed IDMS (a CODASYL system). With Jon’s help as CEO, we raised venture capital and started what turned into Ingres Corp.

这是一次关于如何创办公司的“火一般的考验”,我当然学到了很多东西。很快,Ingres 的商业版本就变得比学术版本好得多。尽管我们在学术版本中实现了抽象数据类型(ADT),但不祥之兆是:继续在学术版本上进行原型设计没有任何意义。是时候开始新的 DBMS 代码线了,Postgres 诞生了。

This was a “trial by fire” on how to start a company, and I certainly learned a lot. Very quickly, the commercial version of Ingres became far better than the academic version. Although we implemented abstract data types (ADTs) in the academic version, the handwriting was on the wall: it made no sense to continue prototyping on the academic version. It was time to start a new DBMS codeline, and Postgres was born.

在我看来,Postgres [Stonebraker 和 Rowe 1986] 有一个非常好的想法(ADT)和一堆容易忘记的想法(继承、使用“always”命令的规则以及 Lisp 中的初始实现,等等)。商业化(如 Illustra Corporation)解决了很多问题;然而,ADT 有点超前于时代,Illustra 很难让现实世界的用户采用它们。因此,Illustra 于 1996 年被出售给 Informix。

In my opinion, Postgres [Stonebraker and Rowe 1986] had one very good idea (ADTs) and a bunch of forgettable ideas (inheritance, rules using an “always” command, and initial implementation in Lisp, to name a few). Commercialization (as Illustra Corporation) fixed a lot of the problems; however, ADTs were somewhat ahead of their time, and Illustra struggled to get real-world users to adopt them. As such, Illustra was sold to Informix in 1996.

Postgres 的辉煌遗产纯粹是偶然的。两位伯克利研究生 Wei Hong 和 Jolly Chen 于 1995 年将 Postgres 的学术版本从 QUEL 转换为 SQL。然后,一个专门的志愿者团队(与我或伯克利没有任何关系)随着时间的推移引导了代码线。那就是您现在可以从https://www.postgresql.org网上下载 Postgres 的开源版本。

The bright legacy of Postgres was purely serendipitous. Two Berkeley grad students, Wei Hong and Jolly Chen, converted the academic version of Postgres in 1995 from QUEL to SQL. Then a dedicated pickup team of volunteers, with no relationship to me or Berkeley, shepherded the codeline over time. That is the open source version of Postgres that you can download today off the Web from https://www.postgresql.org.

分布式 Ingres、Ingres*、Cohera 和 Morpheus

Distributed Ingres, Ingres*, Cohera, and Morpheus

我对分布式数据库的热爱持续了 25 年多(从 20 世纪 80 年代中期到 2000 年代末)。它始于分布式安格尔,它联合了学术安格尔代码线。该系统假设多个位置的模式是相同的,并且数据完全干净并且可以粘贴在一起。代码有点起作用,主要结果是让我相信商业 Ingres 代码线可以以同样的方式联合起来。该项目在 20 世纪 80 年代中期转变为 Ingres*。这两个系统的用户基本上都是零。

My love affair with distributed databases occurred over 25 years (from the mid-1980s to the late 2000s). It started with Distributed Ingres, which federated the academic Ingres codeline. This system assumed that the schemas at multiple locations were identical and that the data was perfectly clean and could be pasted together. The code sort of worked, and the main outcome was to convince me that the commercial Ingres codeline could be federated in the same way. This project turned into Ingres* in the mid-1980s. There were essentially zero users for either system.

我们毫不畏惧,在 20 世纪 90 年代初构建了另一个分布式数据库原型 Mariposa,它基于经济模型执行查询。实际上,Mariposa 仍然假设架构相同且数据干净,但放宽了 Ingres* 假设,即多个站点位于同一管理域中。人们对马里波萨没什么兴趣,但马里波萨的几个学生确实想创办一家公司。与我更好的判断相反,Cohera 诞生了,它再次证明分布式数据库没有市场。

Undaunted, we built another distributed database prototype, Mariposa, in the early 1990s, which based query execution on an economic model. In effect, Mariposa still assumed that the schemas were identical and the data was clean, but relaxed the Ingres* assumption that the multiple sites were in the same administrative domain. There was little interest in Mariposa, but a couple of the Mariposa students really wanted to start a company. Against my better judgment, Cohera was born and it proved yet again that there was no market for distributed databases.

我们仍然毫不畏惧,建造了另一个原型,Morpheus。通过与现实世界的用户交谈,我们意识到模式从来都不相同。因此,墨菲斯专注于将一种模式转化为另一种模式。然而,我们保留了动态执行翻译的分布式数据库模型。我们再次创办了一家名为 Goby 的公司,该公司专注于 Web 数据的集成,基本上没有使用任何 Morpheus 的想法。Goby 属于企业对消费者 (B2C) 领域,换句话说,我们的客户是消费者。在B2C中,必须吸引“眼球”,成功取决于口碑和购买Google关键词。再次强调,戈比并没有取得巨大的成功。然而,这最终让我意识到,联邦数据库对于企业来说并不是一个大问题;相反,它是在独立构建的数据“孤岛”上执行数据集成。最终,这导致了一个原型,Data Tamer [Stonebraker 等人。2013b],实际用户想要尝试。

Still undaunted, we built another prototype, Morpheus. By talking to real-world users, we realized that the schemas were never the same. Hence, Morpheus focused on translating one schema into another. However, we retained the distributed database model of performing the translation on the fly. Again, we started a company, Goby, which focused on integration of Web data, using essentially none of the Morpheus ideas. Goby was in the business-to-consumer (B2C) space, in other words, our customer was a consumer. In B2C, one has to attract “eyeballs” and success depends on word-of-mouth and buying Google keywords. Again, Goby was not a great success. However, it finally made me realize that federating databases is not a big concern to enterprises; rather, it’s performing data integration on independently constructed “silos” of data. Ultimately, this led to a prototype, Data Tamer [Stonebraker et al. 2013b], that actual users wanted to try.

总而言之,我在分布式/联合数据库上花了很多时间,却没有意识到此类产品没有市场。未来能否发展还有待观察。这不仅消耗了很多周期,除了学术论文集之外没有什么可展示的,而且还让我完全错过了我们接下来要转向的一个主要的多节点市场。

In summary, I spent a lot of time on distributed/federated databases without realizing that there is no market for this class of products. Whether one will develop in the future remains to be seen. Not only did this consume a lot of cycles with nothing to show for it except a collection of academic papers, but it also made me totally miss a major multi-node market, which we turn to next.

并行数据库

Parallel Databases

我在 1979 年写了一篇论文,提出了 Muffin [Stonebraker 1979a],一个无共享的并行数据库。不过,我没有进一步追究这个想法。几年后,时任 Sequent Computers 工程师的 Gary Kelley 找到 Ingres,建议双方合作开发并行数据库系统。当时正在开发 Ingres* 的 Ingres 没有足够的资源来半心半意地开展该项目。Gary 随后去了 Informix,在那里他构建了一个非常好的并行数据库系统。总而言之,我完全错过了分布式数据库的重要版本,其中紧密耦合的节点具有相同的模式,即并行分区数据库。这种架构可实现更高的 SQL 性能,尤其是在数据仓库市场中。

I wrote a paper in 1979 proposing Muffin [Stonebraker 1979a], a shared-nothing parallel database. However, I did not pursue the idea further. A couple of years later, Gary Kelley, then an engineer at Sequent Computers, approached Ingres and suggested working together on a parallel database system. Ingres, which was working on Ingres* at the time, did not have the resources to pursue the project more than half-heartedly. Gary then went to Informix, where he built a very good parallel database system. All in all, I completely missed the important version of distributed databases where tightly coupled nodes have the same schema—namely parallel partitioned databases. This architecture enables much higher SQL performance, especially in the data warehouse marketplace.

数据仓库

Data Warehouses

Teradata 在 20 世纪 80 年代末开创了商业并行数据库系统,其架构与 Dave DeWitt 构建的 Gamma 原型大致相同 [DeWitt 等人。1990] 在这两种情况下,想法都是为当时占主导地位的单节点技术(即行存储)添加并行性。在 20 世纪 90 年代末和 2000 年代初,Martin Kersten 提议使用列存储并开始构建 MonetDB [Boncz 等人。2008]。我通过 2002 年左右的一次咨询工作意识到了列存储在数据仓库市场中的重要性。当工作结束时,我开始认真思考 C-Store [Stonebraker 等人。2005a],后来变成了 Vertica。4该代码线支持并行数据库,具有 LSM 风格(日志结构合并)存储基础设施、用于组装要加载的元组的主内存行存储、用于将行存储转换为压缩列的排序引擎以及所谓的集合。实施指数的预测。我的其他大多数初创公司都重写了所有内容,以修复我第一次出错的地方。然而,C-Store 几乎做对了,我对 Vertica 代码线感到非常自豪。直到今天,它在烘焙比赛中几乎是无与伦比的。

Teradata pioneered commercial parallel database systems in the late 1980s with roughly the same architecture as the Gamma prototype built by Dave DeWitt [DeWitt et al. 1990] In both cases, the idea was to add parallelism to the dominant single-node technology of the time, namely row stores. In the late 1990s and early 2000s, Martin Kersten proposed using a column store and started building MonetDB [Boncz et al. 2008]. I realized the importance of column stores in the data warehouse market through a consulting gig around 2002. When the gig ended I started thinking seriously about C-Store [Stonebraker et al. 2005a], which turned into Vertica.4 This codeline supported parallel databases, with an LSM-style (Log Structure Merge) storage infrastructure, a main memory row store to assemble tuples to be loaded, a sort engine to turn the row store into compressed columns, and a collection of so-called projections to implement indexes. Most of my other startups rewrote everything to fix the stuff that I got wrong the first time around. However, C-Store pretty much got it right, and I feel very proud of the Vertica codeline. To this day, it is nearly unbeatable in bakeoffs.

总而言之,获得咨询工作是很偶然的,它让我了解了数据仓库性能问题的真正本质。如果没有这一点,我是否会在这个领域工作就值得怀疑。

In summary, it was serendipitous to get the consulting gig, which got me to understand the true nature of the data warehouse performance problem. Without that, it is doubtful I would have ever worked in this area.

H 存储/VoltDB

H-Store/VoltDB

Matrix Partners 的风险投资人 David Skok 有一天表示,开发一种新的 OLTP 架构会很棒,该架构不同于基于磁盘的行存储。时间。便利店让我相信“一刀切并不适合所有人”[Stonebraker 和 Çetintemel 2005],所以我对这个总体想法持开放态度。此外,很明显,数据库缓冲池会占用典型 DBMS 中的大量周期。当 Dave DeWitt 访问麻省理工学院时,我们安装了他的原型 Shore [Carey 等人。1994],以准确了解所有周期的去向。这生成了“OLTP:通过镜子”论文 [Harizopoulos 等人。2008];我们所有人都对多线程、缓冲池、并发控制和日志消耗了 OLTP 中 CPU 周期的绝大多数感到震惊。再加上主内存容量的不断增加,H-Store 诞生了,后来又诞生了 VoltDB 公司。5

David Skok, a VC with Matrix Partners, suggested one day that it would be great to work on a new OLTP architecture, different from the disk-based row stores of the time. C-Store had convinced me that “one size does not fit all” [Stonebraker and Çetintemel 2005], so I was open to the general idea. Also, it was clear that the database buffer pool chews up a lot of cycles in a typical DBMS. When Dave DeWitt visited MIT, we instrumented his prototype, Shore [Carey et al. 1994], to see exactly where all the cycles went. This generated the “OLTP: Through the Looking Glass” paper [Harizopoulos et al. 2008]; all of us were shocked that multi-threading, the buffer pool, concurrency control, and the log consumed an overwhelming fraction of the CPU cycles in OLTP. This, coupled with the increasing size of main memory, led to H-Store and then to a company, VoltDB.5

大卫·斯科克的那句随口的话肯定留在了我的记忆中,并让我在一两年后认真地看待。当然,其中涉及机缘巧合。

The offhand remark from David Skok certainly stuck in my memory and caused me to look seriously a year or two later. Certainly, there was serendipity involved.

数据驯服者

Data Tamer

Joey Hellerstein 决定于 2010 年至 2011 年访问哈佛大学,我们同意就可能的研究进行集思广益。这很快演变成了一个名为 Data Tamer 的数据集成项目。在之前的公司 Goby 中,我们一直在尝试集成 80,000 个网站的内容,并且在定制代码解决方案方面遇到了困难。Goby 愿意提供其原始数据,即他们对 80,000 个网站的爬行数据。我们决定处理他们的数据,该项目很快就发展为尝试进行可扩展的数据集成。Goby 数据需要模式集成、数据清理和实体整合,这是我们开始解决的问题。我们项目的会谈为我们带来了另外两个企业客户:Verisk Health 和诺华。两者都主要关注实体整合。

Joey Hellerstein decided to visit Harvard in 2010–2011, and we agreed to brainstorm about possible research. This quickly evolved into a data integration project called Data Tamer. In a previous company, Goby, we had been trying to integrate the contents of 80,000 websites and had struggled with a custom code solution. Goby was willing to supply its raw data, that is, their crawl of the 80,000 sites. We decided to work on their data, and the project quickly evolved to trying to do scalable data integration. Goby data needed schema integration, data cleaning, and entity consolidation, which we started addressing. Talks on our project brought us two additional enterprise clients, Verisk Health and Novartis. Both were focused primarily on entity consolidation.

大约在这个时候,麻省理工学院正在与卡塔尔计算研究所(QCRI)建立合作关系,Data Tamer 成为这一合作中的早期项目。实际上,Data Tamer 专注于解决 Goby、Verisk Health 和诺华提出的数据集成问题。在某种程度上,我认为这是一个理想的启动模板:找到一些有问题的客户,然后尝试解决它。6

At about this time, MIT was setting up a relationship with the Qatar Computing Research Institute (QCRI), and Data Tamer became an early project in this collaboration. In effect, Data Tamer was focused on solving the data integration presented by Goby, Verisk Health, and Novartis. In a way, I think this is an ideal startup template: find some clients with a problem and then try to solve it.6

再次,这其中涉及到很多机缘巧合。如果没有乔伊访问哈佛并且没有戈比愿意提供其数据,Data Tamer 就不会发生。

Again, there was much serendipity involved. Data Tamer would not have happened without Joey visiting Harvard and without Goby being willing to provide its data.

如何利用创意

How to Exploit Ideas

在每种情况下,我们都构建了一个原型来演示这个想法。在早期(Ingres/Postgres),这些是全功能系统;在后来的日子里(C-Store/H-Store),原型走了很多弯路。早期,研究生乐于编写大量代码;现在,他们很乐意编写大量代码。如今,大型实施对于研究生的出版健康来说是危险的。换句话说,获得一份好工作的发表要求与获得全功能原型的要求是相反的!

In every case, we built a prototype to demonstrate the idea. In the early days (Ingres/Postgres), these were full-function systems; in later days (C-Store/H-Store), the prototypes cut a lot of corners. In the early days, grad students were happy to write a lot of code; these days, big implementations are dangerous to the publication health of grad students. In other words, the publication requirements of getting a good job are contrary to getting full-function prototypes to work!

在这两种情况下,快乐的研究生都是成功的必要条件。在这个方向上我只想说两点。首先,我认为我的工作是确保研究生取得成功,换句话说,在研究生毕业后获得一个好的职位。因此,我认为我的工作就是花尽可能多的时间帮助学生取得成功。对我来说,这意味着向学生提供好的想法,让他们继续研究,直到他们掌握了产生自己的想法的窍门。相比之下,一些教授认为应该让学生陷入困境,直到他们掌握了研究的窍门。此外,我相信公平对待学生,花尽可能多的时间与他们在一起,教他们如何撰写技术论文,并且总的来说,尽一切努力使他们成功。总的来说,这种哲学产生了充满活力、成功、

In both cases, happy grad students are a requirement for success. I have only two points to make in this direction. First, I view it as my job to make sure that a grad student is successful, in other words, gets a good position following grad school. Hence, I view it as my job to spend as much time as necessary helping students be successful. To me, this means feeding good ideas to students to work on until they can get the hang of generating their own. In contrast, some professors believe in letting their students flounder until they get the hang of research. In addition, I believe in treating students fairly, spending as much time with them as necessary, teaching them how to write technical papers, and, in general, doing whatever it takes to make them successful. In general, this philosophy has produced energetic, successful, and happy students who have gone on to do great things.

结束观察

Closing Observations

好的想法总是简单的;可以用几句话向研究员同事解释这个想法。换句话说,好的想法似乎总是有一个简单的“电梯推销”。记住 KISS 的格言是明智之举:“保持简单,愚蠢。” 这片土地上到处都是无法实现的想法。此外,永远不要试图“煮沸海洋”。确保你的原型高度集中。最后,好的想法总是会出现。如果您今天没有好主意,请不要绝望。这句格言对我来说尤其正确:74岁了,我一直担心自己“失去了它”。不过,我似乎还是时不时地想到一些好主意……

Good ideas are invariably simple; it is possible to explain the idea to a fellow researcher in a few sentences. In other words, good ideas seem to always have a simple “elevator pitch.” It is wise to keep in mind the KISS adage: “Keep it Simple, Stupid.” The landscape is littered with unbuildable ideas. In addition, never try to “boil the ocean.” Make sure your prototypes are ultra-focused. Lastly, good ideas come whenever they come. Don’t despair if you don’t have a good idea today. This adage is especially true for me: At 74 years old, I keep worrying that I have “lost it.” However, I still seem to get good ideas from time to time …

1 . 请参阅第 13 章中的安格尔的影响

1. See Ingres’ impact in Chapter 13

2 . 抽象数据类型的使用被认为是我最重要的贡献之一,在3、12、15和16中讨论

2. The use of Abstract Data Types, considered one of my most important contributions, is discussed in Chapters 3, 12, 15, and 16

3 . 请参阅第 13 章中的 Postgres 的影响。

3. See Postgres’ impact in Chapter 13.

4 . 有关 Vertica 的故事,请参阅第 18 章

4. For the Vertica story, see Chapter 18.

5 . 有关 VoltDB 的故事,请参阅第 19 章。

5. See Chapter 19 for the VoltDB story.

6 . 详细说明请参见第 7 章(Stonebraker)。

6. See Chapter 7 (Stonebraker) for a detailed description.

11

11

我们失败的地方

Where We Have Failed

迈克尔·斯通布雷克

Michael Stonebraker

在本章中,我提出了我们作为研究团体失败的三个领域。在每种情况下,我都会指出这些故障的各种重大后果,并提出缓解措施。总而言之,这些失败让我对我们这个领域的未来感到担忧。

In this chapter, I suggest three areas where we have failed as a research community. In each case, I indicate the various ramifications of these failures, which are substantial, and propose mitigations. In all, these failures make me concerned about the future of our field.

三个失败

The Three Failures

失败#1:我们未能应对不断扩大的领域

Failure #1: We Have Failed to Cope with an Expanding Field

我们的社区率先将数据子语言从 IMS 和 CODASYL 时代(1970 年代)的一次记录语言提升到今天的一次设置关系语言。这一变化主要发生在 20 世纪 80 年代,伴随着关系模型的几乎普遍采用,使我们的社区能够研究查询优化器、执行引擎、完整性约束、安全性、视图、并行性、多模式功能以及无数的现代 DBMS 的其他功能。

Our community spearheaded the elevation of data sublanguages from the record-at-a-time languages of the IMS and CODASYL days (1970s) to the set-at-a-time relational languages of today. This change, which occurred mostly in the 1980s along with the near-universal adoption of the relational model, allowed our community to investigate query optimizers, execution engines, integrity constraints, security, views, parallelism, multi-mode capabilities, and the myriad of other capabilities that are features of modern DBMSs.

大约 35 年前,随着 1984 年 DB2 的推出,SQL 被采用为事实上的 DBMS 接口。当时,DBMS 领域的范围本质上是业务数据处理。图 11.1显示了 1985 年左右的情况,并显示了 20 世纪 80 年代范围的扩大。最终结果是一个研究社区专注于业务数据处理客户的一组常见问题。实际上,我们的领域在为业务数据处理客户寻求更好的数据管理方面是统一的。

SQL was adopted as the de facto DBMS interface nearly 35 years ago with the introduction of DB2 in 1984. At the time, the scope of the DBMS field was essentially business data processing. Pictorially, the state of affairs circa 1985 is indicated in Figure 11.1, with the expansion of scope in the 1980s indicated. The net result was a research community focused on a common set of problems for a business data processing customer. In effect, our field was unified in its search for better data management for business data processing customers.

从那时起,DBMS 的范围急剧扩大,几乎每个人都意识到他们需要数据管理功能,这给我们带来了巨大的好处。图 11.2展示了 30 年后的宇宙。尽管在业务数据处理(数据仓库、OLTP)方面存在一些活动,但许多焦点已转移到各种其他应用程序领域。图 11.2列出了其中两个:机器学习和科学数据库。在这些领域,关系数据模型并不流行,SQL 被认为是无关紧要的。该图列出了研究人员关注的一些工具。请清楚地注意,各个领域的重要主题和工具有很大不同。此外,这些研究主旨之间的交叉很少。

Since then, to our great benefit, the scope of DBMSs has expanded dramatically as nearly everyone has realized they need data management capabilities. Figure 11.2 indicates our universe 30 years later. Although there is some activity in business data processing (data warehouses, OLTP), a lot of the focus has shifted to a variety of other application areas. Figure 11.2 lists two of them: machine learning and scientific databases. In these areas, the relational data model is not popular, and SQL is considered irrelevant. The figure lists some of the tools researchers are focused on. Note clearly that the important topics and tools in the various areas are quite different. In addition, there is minimal intersection of these research thrusts.

图像

图 11.1   1985 年左右的世界:业务数据处理。

Figure 11.1  Our universe circa 1985: business data processing.

图像

图 11.2  我们现在的宇宙。

Figure 11.2  Our universe now.

由于我们领域范围的扩大,20 世纪 80 年代的凝聚力消失了,取而代之的是专注于截然不同问题的子社区。实际上,我们的领域现在由一组子组组成,这些子组研究不同的主题并优化特定于应用程序的功能。将这些特定于域的功能连接到持久存储是通过特定于域的解决方案完成的。实际上,我们已经分解为 N 个特定领域组的集合,它们之间几乎没有交互。

As a result of the expanding scope of our field, the cohesiveness of the 1980s is gone, replaced by sub-communities focused on very different problems. In effect, our field is now composed of a collection of subgroups, which investigate separate topics and optimize application-specific features. Connecting these domain-specific features to persistent storage is done with domain-specific solutions. In effect, we have decomposed into a collection of N domain-specific groups with little interaction between them.

人们可能希望这些单独的领域可以通过某种更高级别的包含查询符号来统一。换句话说,我们可能希望有更高层次的抽象来重新统一该领域。在业务数据处理中,人们一直在努力开发更高级别的符号,无论是基于逻辑的(Prolog、Datalog)还是基于编程的(Ruby on Rails、LINQ)。但没有一个在市场上流行起来。

One might hope that these separate domains might be unified through some higher-level encompassing query notation. In other words, we might hope for a higher level of abstraction that would reunify the field. In business data processing, there have been efforts to develop higher-level notations, whether logic-based (Prolog, Datalog) or programming-based (Ruby on Rails, LINQ). None have caught on in the marketplace.

除非我们能够找到生成更高级别界面的方法(我认为目前这不太可能),否则我们实际上将保持一个松散的团体联盟,几乎没有共同的研究重点。

Unless we can find ways to generate a higher-level interface (which I consider unlikely at this point), we will effectively remain a loose coalition of groups with little research focus in common.

这种状况可以用“中间空”来形容。我快速调查了涉及 1977 年定义的领域(存储结构、查询处理、安全性、完整性、查询语言、数据转换、数据集成和数据库设计)的 ACM SIGMOD 论文的百分比。今天,我将其称为我们领域的核心。结果如下:

This state of affairs can be described as “the hollow middle.” I did a quick survey of the percentage of ACM SIGMOD papers that deal with our field as it was defined in 1977 (storage structures, query processing, security, integrity, query languages, data transformation, data integration, and database design). Today, I would call this the core of our field. Here is the result:

1977年

1977

100%

100%

(21/21)

(21/21)

1987年

1987

  93%

  93%

(40/43)

(40/43)

1998年

1998

  68%

  68%

(30/44)

(30/44)

2008年

2008

  52%

  52%

(42/80)

(42/80)

2017年

2017

  47%

  47%

(42/90)

(42/90)

我只收录了研究论文,而不包括工业或演示曲目(当这些曲目出现时)中的论文。请注意,随着我们的研究人员转向 40 年前所谓的应用程序的研究,核心正在被“掏空”。在我看来,这种转变的原因是DBMS技术(OLTP、业务数据仓库)的历史使用已经相当成熟。结果,研究人员基本上转向了其他挑战。

I include only research papers, and not papers from the industrial or demo tracks, when these tracks came into existence. Notice that the core is being “hollowed out,” as our researchers drift into working on what would have been called applications 40 years ago. In my opinion, the reason for this shift is that the historical uses of DBMS technology (OLTP, business data warehouses) are fairly mature. As a result, researchers have largely moved on to other challenges.

然而,事实仍然是各个应用领域之间几乎没有或根本没有共同点。自然语言处理 (NLP) 中的重要内容与机器学习 (ML) 或科学数据处理中的重要内容完全不同。最终的结果是,我们本质上已经“分裂”成彼此不沟通的子组。这让人想起 1982 年 ACM SIGMOD 和 ACM PODS 分裂时发生的分歧。

However, the fact remains that there is little or no commonality across the various applications areas. What is important in Natural Language Processing (NLP) is totally different from what is important in machine learning (ML) or scientific data processing. The net effect is that we have essentially “multi-furcated” into subgroups that don’t communicate with each other. This is reminiscent of the 1982 bifurcation that occurred when ACM SIGMOD and ACM PODS split.

我主要的抱怨是我们没有意识到这一点,而且我们的会议结构(主要是对 30 年前的想法进行的小调整)并不特别适合当前时代。在我看来,最好的解决方案是认识到中间的空洞,并将主要的 DBMS 会议(SIGMOD、VLDB、ICDE)分解为多个(比如五个或六个)独立的轨道,并由独立的程序委员会组成。这些可以位于同一地点或单独组织。换句话说,沿着多年前的 SIGMOD/PODS 划分的路线进行多分支。

My chief complaint is that we have failed to realize this, and our conference structures (mostly minor tweaks on 30-year-old ideas) are not particularly appropriate for the current times. In my opinion, the best solution is to recognize the hollow middle, and decompose the major DBMS conferences (SIGMOD, VLDB, ICDE) into multiple (say five or six) separate tracks with independent program committees. These can be co-located or organized separately. In other words, multi-furcate along the lines of the SIGMOD/PODS division many years ago.

如果这种情况不发生,那么当前的会议将仍然是“动物园”,很难甚至不可能找到志同道合的研究人员。此外,评审会很混乱(如下所述),这会让研究人员感到沮丧。“系统人士”似乎对当前的事态感到特别不安。他们即将宣布离婚并开始自己的会议。其他小组可能会跟进。结果将是我们所知的领域的崩溃。

If this does not happen, then the current conferences will remain “zoos” where it is difficult to impossible to find like-minded researchers. Also, reviewing will be chaotic (as discussed below), which frustrates researchers. The “systems folks” seem particularly upset by the current state of affairs. They are on the verge of declaring a divorce and starting their own conference. Other subgroups may follow. The result will be a collapse of the field as we know it.

你可能会问:“哪里有跨文化研究的空间?” 我已经很久没有在SIGMOD会议上见到其他领域的人了。同样,节目委员会也没有这种多元文化。显而易见的答案是举办更多的跨文化会议。以前我们也有这样的东西,但近年来它们已经失宠了。

You might ask, “Where is there room for cross-cultural research?” I have not seen people from other fields at SIGMOD conferences in quite a while. Equally, program committees do not have such multi-culturalism. The obvious answer is to have more cross-cultural conferences. In olden times, we used to have such things, but they have fallen out of favor in recent years.

失败#2:我们忘记了我们的客户是谁

Failure #2: We Have Forgotten Who Our Customer Is

四十年前,有一批“行业型”干部来参加我们的会议。他们是DBMS技术的早期用户(布道者),来自金融服务、保险、石油勘探等行业。因此,他们提供了一个方便的现实检查,以检查任何给定的想法是否与现实世界相关。实际上,他们是我们“客户”的代理人。因此,我们的使命是为以布道者为代表的 DBMS 技术的广大客户群提供更好的 DBMS 支持。

Forty years ago, there was a cadre of “industry types” who came to our conferences. They were early users (evangelists) of DBMS technology and came from financial services, insurance, petroleum exploration, and so on. As such, they provided a handy reality check on whether any given idea had relevance in the real world. In effect, they were surrogates for our “customer.” Hence, our mission was to provide better DBMS support for the broad customer base of DBMS technology, represented by the evangelists.

多年来,这些传道者基本上从我们的会议中消失了。因此,我们的领域不存在面向客户的影响。相反,有大型互联网供应商的代表——我称他们为“鲸鱼”——他们有自己的优先事项,代表了 DBMS 用户中最大的 0.01%。因此,它们很难代表现实世界。实际上,我们的客户已经消失,取而代之的是鲸鱼或真空。

Over the years, these evangelists have largely disappeared from our conferences. As such, there is no customer-facing influence for our field. Instead, there are representatives of the large Internet vendors—I call them “the whales”—who have their own priorities and represent the largest 0.01% of DBMS users. Hence, they hardly represent the real world. In effect, our customer has vanished and been replaced by either the whales or a vacuum.

客户的流失导致了一系列不良影响。首先,没有“现实世界”的客户让我们保持专注。因此,我们很容易认为鲸鱼的下一个“银营销子弹”实际上是一个好主意。我们的社区曾经接受过,然后又拒绝了(当我们发现这个想法很糟糕时)OLEDB、MapReduce、语义 WEB、对象数据库、XML 和数据湖,仅举几例。

This loss of our customer has resulted in a collection of bad effects. First, there are no “real world” clients to keep us focused. As such, we are prone to think the next “silver marketing bullet” from the whales is actually a good idea. Our community has embraced and then rejected (when it became apparent that the idea was terrible) OLEDB, MapReduce, the Semantic WEB, Object Databases, XML, and data lakes, just to name a few.

我们对大型互联网供应商创建的用于解决特定于应用程序的问题的系统非常不挑剔,直到最近,这些系统都是由几乎没有 DBMS 背景的开发团队编写的。因此,他们倾向于重新发明轮子。我对 Google 先采用然后拒绝 MapReduce 和最终一致性感到特别好笑。

We are very uncritical of systems created by the large Internet vendors to that solve application-specific problems, which have been written, until recently, by development teams with little background in DBMSs. As such, they have tended to reinvent the wheel. I am especially amused by Google’s adoption and then rejection of MapReduce and eventual consistency.

我们的社区需要更加自信地指出有缺陷的想法和“重新发明轮子”。否则,“不懂历史的人注定要重蹈覆辙”这句口头禅将继续成立。

Our community needs to become more assertive at pointing out flawed ideas and “reinventions of the wheel.” Otherwise, the mantra “people who do not understand history will be condemned to repeat it” will continue to be true.

我们迫切需要与“现实世界”重新建立联系。这可以通过向真实用户提供免费会议注册、组织真实用户小组、邀请真实用户进行简短的问题评论等来完成。我还对我们会议上没有实际经验的与会者人数感到很有趣。将 DBMS 技术应用于实际问题。我们的领域存在是为了服务客户。如果客户是“我们”,那么我们就完全迷失了方向。

We desperately need to reconnect with the “real world.” This could be done by giving free conference registrations to real-world users, organizing panels of real users, inviting short problem commentaries from real users, etc. I am also amused at the number of attendees at our conferences who have no practical experience in applying DBMS technology to real problems. Our field exists to serve a customer. If the customer is “us,” then we have totally lost our way.

失败#3:我们没有解决纸张泛滥的问题

Failure #3: We Have Not Addressed the Paper Deluge

1971 年我从密歇根大学毕业并获得博士学位时,我的简历中论文为零。五年后,我凭借六篇论文的简历获得了伯克利的终身教职。我这个年龄段的其他人(例如戴夫·德威特)也报告了类似的数字。如今,要获得一份体面的工作并获得新的博士学位,需要大约 10 篇论文;目标是在40岁左右获得终身教职。在进入学术就业市场之前,先做两年的博士后来充实自己的简历已经变得很普遍。如今另一个常见的策略是接受一份学术工作,然后通过接受一年的博士后来推迟终身任职时钟的开始。例如,Peter Bailis(现就职于斯坦福大学)和 Leilani Battle(现就职于马里兰州)最近就完成了这项工作。这两种情况的目标都是在获得终身职位所需的大量论文方面抢占先机。

When I graduated from Michigan in 1971 with a Ph.D., my resume consisted of zero papers. Five years later, I was granted tenure at Berkeley with a resume of half a dozen papers. Others from my age cohort (e.g., Dave DeWitt) report similar numbers. Today, to get a decent job with a fresh Ph.D., one needs around 10 papers; to get tenure more like 40 is the goal. It is becoming common to take a two-year postdoc to build up one’s resume before hitting the academic job market. Another common tactic these days is to accept an academic job and then delay starting the tenure clock by taking a postdoc for a year. This was done recently, for example, by Peter Bailis (now at Stanford) and Leilani Battle (now at Maryland). The objective in both situations is to get a head start on the paper deluge required for tenure.

换句话说,所需的纸张产量出现了一个数量级的升级。除此之外,今天的 DBMS 研究人员比 40 年前多了一个数量级,我们的论文产量也增加了两个数量级。没有办法应对这场洪水。这种升级有几个影响。

Put differently, there has been an order of magnitude escalation in desired paper production. Add to this fact that there are (say) an order of magnitude more DBMS researchers today than 40 years ago, and we get paper output rising by two orders of magnitude. There is no possible way to cope with this deluge. There are several implications of this escalation.

首先,我阅读一篇论文的唯一方式是,如果其他人说它非常好,或者它是由我经常关注的少数研究人员之一撰写的。结果,我们正在成为一个“口碑”分销系统。这使得来自内陆地区的人几乎不可能出名。换句话说,你要么与“人群”中的人一起工作,要么你在“外西伯利亚”。这造成了一个不公平的终身教职竞争环境。

First, the only way that I would ever read a paper is if somebody else said it was very good or if it was written by one of a handful of researchers that I routinely follow. As a result, we are becoming a “word of mouth” distribution system. That makes it nearly impossible for somebody from the hinterlands to get well known. In other words, you either work with somebody from the “in crowd” or you are in “Outer Siberia.” This makes for an un-level tenure-track playing field.

其次,每个人都将自己的想法划分为最少可发布单位(LPU),以生成终身职位所需的数量。一般来说,这意味着没有关于特定想法的开创性论文,只有一堆 LPU。这使得理解该领域的重要思想变得困难。它还增加了研究人员必须阅读的论文数量,这让我们都脾气暴躁。

Second, everybody divides their ideas into Least Publishable Units (LPUs) to generate the kind of volume that a tenure case requires. Generally, this means there is no seminal paper on a particular idea, just a bunch of LPUs. This makes it difficult to follow the important ideas in the field. It also ups the number of papers researchers must read, which makes us all grumpy.

第三,很少有研究生愿意承担重大的实施项目。如果你必须在(比如说)四年的生产时间内写十篇论文,那就是每五个月一篇论文。您没有时间进行重大实施。这导致研究生专注于“速成”,并将论文写作转向理论论文。稍后会详细介绍这一点。

Third, few graduate students are willing to undertake significant implementation projects. If you have to write ten papers in (say) four productive years, that is a paper every five months. You cannot afford the time for significant implementations. This results in graduate students being focused on “quickies” and tilts paper production toward theory papers. More on this later.

那么这场纸爆炸事件是如何发生的呢?它是由一系列大学推动的,其中大部分位于远东,这些大学的院长和系主任懒得真正评估特定研究人员的贡献。相反,他们只是数数论文作为替代品。这种懒惰似乎也存在于欧美一些二流、三流大学。

So how did this paper explosion occur? It is driven by a collection of universities, mostly in the Far East, whose deans and department chairpeople are too lazy to actually evaluate the contribution of a particular researcher. Instead they just count papers as a surrogate. This laziness also appears to exist at some second- and third-rate U.S. and European universities.

我们未能应对洪水,将使这种现象在未来变得更糟。因此,我们应该积极制止它,这里有一个简单的想法。让(比如说)美国最好的 25 所大学的系主任采纳以下原则是相当简单的:

Our failure to deal with the deluge will allow this phenomenon to get worse off into the future. Hence, we should actively put a stop to it, and here is a simple idea. It would be fairly straightforward to get the department chairpeople of (say) the 25 best U.S. universities to adopt the following principle:

任何申请助理教授职位的 DBMS 申请人都需要提交一份最多包含三篇论文的简历。任何即将获得终身职位的人都可以提交一份最多包含十篇论文的简历。如果申请人提交的简历较长,则会被发回给申请人进行删减。几年之内,这将从根本上改变出版模式。谁知道呢,它甚至可能传播到计算机科学的其他学科。

Any DBMS applicant for an Assistant Professor position would be required to submit a resume with at most three papers on it. Anybody coming up for tenure could submit a resume with at most ten papers. If an applicant submitted a longer resume, it would be sent back to the applicant for pruning. Within a few years, this would radically change publication patterns. Who knows, it might even spread to other disciplines in computer science.

我们的三次失败的后果

Consequences of Our Three Failures

结果一:复习很糟糕

Consequence #1: Reviewing Stinks

论文泛滥和“中间空洞”的后果就是审稿质量,总的来说,审稿质量很差。根据我的经验,审稿人的评论中约有一半是不切实际的。此外,评论的差异非常大。当然,问题是一个程序委员会有200人左右,所以这是一个偶然的事情。不同成员的偏见和倾向只会增加差异。考虑到“中间空洞”,任何给定论文的应用领域中找到三位专家的审稿人的机会很低,从而增加了差异。再加上大量的论文,你会得到非常高的数量和很低的审稿质量。

A consequence of the paper deluge and the “hollow middle” is the quality of reviewing, which, in general, stinks. In my experience, about half of the comments from reviewers are way off the mark. Moreover, the variance of reviews is very high. Of course, the problem is that a program committee has about 200 members, so it is a hit-or-miss affair. The biases and predispositions of the various members just increase the variance. Given the “hollow middle,” the chances of getting three reviewers who are experts in the application domain of any given paper is low, thereby augmenting the variance. Add to this the paper deluge and you get very high volume and low reviewing quality.

那么,会发生什么呢?平均论文会被重写并重新提交多次。最终,它通常会在某个地方被接受。毕竟,研究人员必须发表论文,否则就会灭亡!

So, what happens? The average paper is rewritten and resubmitted multiple times. Eventually, it generally gets accepted somewhere. After all, researchers have to publish or perish!

在古代,程序主席会阅读所有论文,并在评审过程中至少施加一定程度的统一。此外,还有面对面的计划委员会会议,在一群同行面前讨论分歧。这种情况早已不复存在——因为审稿问题的规模(大约 800 篇论文)已经超出了要求。在我看来,旧时代的策略产生了更好的结果。

In ancient times, the program chairperson read all the papers and exerted at least some uniformity on the reviewing process. Moreover, there were face-to-face program committee meetings where differences were hashed out in front of a collection of peers. This is long gone—overrun by the size (some 800 papers) of the reviewing problem. In my opinion, the olden times strategy produced far better results.

明显有用的步骤是将主要的 DBMS 会议分解为子会议,如失败#1 中所述。这样的分会议将有(比如说)75 篇论文。这将允许“老派”审查和项目委员会。这通过加强当前的“区域主席”概念,可以轻松采用细分。这些分会议可以在同一地点举行,也可以不同地点举行;两种可能性各有利弊。

The obvious helpful step would be to decompose the major DBMS conferences into subconferences as noted in Failure #1. Such subconferences would have (say) 75 papers. This would allow “old school” reviewing and program committees. This subdivision could be adopted easily by putting the current “area chairperson” concept on steroids. These subconferences could be co-located or not; there are pros and cons to both possibilities.

另一个步骤是极大地改变纸张选择的方式。例如,我们可以简单地接受所有论文,并将评论以及审稿人的分数公开。然后,研究人员可以将他或她的论文以及他或她收到的综合分数放在他或她的简历上。根据分数,论文将在会议上曝光(长时段、短时段、海报)。然而,最好的解决方案是解决失败#3(论文泛滥)。

Another step would be to dramatically change the way paper selection is done. For example, we could simply accept all papers, and make reviews public, along with reviewers’ scores. A researcher could then put on his or her resume his or her paper and the composite score he or she received. Papers would get exposure at a conference (long slot, short slot, poster) based on the scores. However, the best solution would be to solve Failure #3 (the paper deluge).

如果现状持续下去,方差只会增加,导致越来越差的结果的开销越来越大。

If the status quo persists, variance will just increase, resulting in more and more overhead for poorer and poorer results.

结果#2:我们失去了研究品味

Consequence #2: We Have Lost our Research Taste

面对所需的纸张生产,我们的领域已经转向解决人为问题,特别是在以前的工作(最少可出版单位)的基础上进行 10% 的改进。随着时间的推移,主要 DBMS 会议上看似与实际情况完全无关的论文数量似乎在不断增加。当然,争论的焦点是,不可能决定一篇论文在以后的某个时间点是否会产生影响。然而,比之前的工作提高 10% 的论文数量似乎非常多。一个复杂的算法比现有的更简单的算法提高 10% 是不值得的。此类论文的作者表现出的工程品味很差。我总体感觉在过去的十年里我们一直在抛光一个圆球。我想提出以下问题:“上一篇对我们的领域做出巨大贡献的论文是什么?” 如果你说一篇近十年写的论文,我想知道它是什么。

Faced with required paper production, our field has drifted into solving artificial problems, and especially into making 10% improvements on previous work (Least Publishable Units). The number of papers at major DBMS conferences that seem completely irrelevant to anything real seems to be increasing over time. Of course, the argument is that it is impossible to decide whether a paper will have impact at some later point in time. However, the number of papers that make a 10% improvement over previous work seems very large. A complex algorithm that makes a 10% improvement over an existing, simpler one is just not worth doing. Authors of such papers are just exhibiting poor engineering taste. I have generally felt that we were polishing a round ball for about the last decade. I would posit the following question: “What was the last paper that made a dramatic contribution to our field?” If you said a paper written in the last ten years, I would like to know what it is.

一个可能的策略是要求每位助理教授在获得终身教授职位之前,在工业界工作一年。没有什么比在现实世界中度过一段时间更能产生现实检验的了。当然,实施这一策略首先需要解决失败#3。在目前的造纸环境下,花一年时间不磨纸是有勇无谋的。

A possible strategy would be to require every assistant professor to spend a year in industry, pre-tenure. Nothing generates a reality check better than some time spent in the real world. Of course, implementing this tactic would first require a solution to Failure #3. In the current paper climate, it is foolhardy to spend a year not grinding out papers.

结果#3:不相关的理论正在占据主导地位

Consequence #3: Irrelevant Theory Is Taking Over

鉴于我们的客户已经消失并且提供了所需的纸张生产,显而易见的策略是磨出“快速”。快速优化的明显方法是包含大量理论,无论是否与当前问题相关。这有两个主要好处。首先,它使得论文速度更快,因此体积更大。其次,没有定理、引理和证明,一篇重要的会议论文很难被接受。因此,这优化了接受概率。

Given that our customer has vanished and given the required paper production, the obvious strategy is to grind out “quickies.” The obvious way to optimize quickies is to include a lot of theory, whether relevant to the problem at hand or not. This has two major benefits. First, it makes for quicker papers, and therefore more volume. Second, it is difficult to get a major conference paper accepted without theorems, lemmas, and proofs. Hence, this optimizes acceptance probability.

这种对理论的关注,无论相关与否,有效地保证了我们的会议上不会提出任何大的想法。它还保证了任何想法在被打磨得相当圆润之前都不会被接受。

This focus on theory, relevant or not, effectively guarantees that no big ideas will ever be presented at our conferences. It also guarantees that no ideas will ever be accepted until they have been polished to be pretty round.

我个人的经验是,实验论文很难得到各大会议审稿人的认可,大部分是因为他们没有理论。一旦我们开始打磨圆球,那么论文的质量就不再是由思想的质量来衡量,而是由理论的质量来衡量。为了给这种效应起个绰号,我称之为“过度形式主义”,即引理和证明,除了赋予论文理论地位外,对增强论文没有任何作用。这种“无关紧要的理论”本质上保证了会议论文会脱离现实。实际上,我们正在转向那些其合理性与解决现实世界问题几乎没有任何关系的论文。由于这种态度,我们的社区已经从服务客户(现实世界中遇到问题的人)转变为服务我们自己(用有趣的数学)。当然,这与失败#3有关:通过“快速”获得终身职位是最优化的。我并不反对理论论文,只是不相关的理论有问题。

My personal experience is that experimental papers are difficult to get by major conference reviewers, mostly because they have no theory. Once we move to polishing a round ball, then the quality of papers is not measured by the quality of the ideas, but by the quality of the theory. To put a moniker on this effect, I call this “excessive formalism,” which is lemmas and proofs that do nothing to enhance a paper except to give it theoretical standing. Such “irrelevant theory” essentially guarantees that conference papers will diverge from reality. Effectively, we are moving to papers whose justification has little to nothing to do with solving a real-world problem. Because of this attitude, our community has moved from serving a customer (some real-world person with a problem) to serving ourselves (with interesting math). Of course, this is tied to Failure #3: Getting tenure is optimized by “quickies.” I have nothing against theoretical papers, just a problem with irrelevant theory.

当然,我们的研究人员会断言,解决现实世界的问题太困难了。事实上,由于真正的企业拒绝与我们深度合作,社区已经变得贫瘠。谷歌、亚马逊或微软等公司也拒绝与我们的社区共享数据。此外,我从一家大型投资银行获取软件故障数据的努力也受到了阻碍,因为考虑到崩溃的频率,该银行不想让其 DBMS 供应商看起来很糟糕。我还被几个大型组织拒绝访问软件演化代码,这些组织显然已经决定他们的编码技术、他们的代码或两者都是专有的(或者可能经不起审查)。

Of course, our researchers will assert that it is too difficult to get access to real-world problems. In effect, the community has been rendered sterile by the refusal of real enterprises to partner with us in a deep way. The likes of Google, Amazon, or Microsoft also refuse to share data with our community. In addition, my efforts to obtain data on software faults from a large investment bank were stymied because the bank did not want to make its DBMS vendor look bad, given the frequency of crashes. I have also been refused access to software evolution code by several large organizations, which apparently have decided that their coding techniques, their code, or both were proprietary (or perhaps may not stand up to scrutiny).

因此,我们主要处理人工基准(例如 YCSB 1)或远离主流的基准(例如维基百科)。我特别想起了一个研究线索,该研究将错误人为地引入到数据集中,然后证明所提出的算法可以找到它们注入的错误。在我看来,这绝对不能证明现实世界的任何情况。

As a result, we deal primarily with artificial benchmarks (such as YCSB1) or benchmarks far outside the mainstream (such as Wikipedia). I am particularly reminded of a thread of research that artificially introduced faults into a dataset and then proved that the algorithms being presented could find the faults they injected. In my opinion, this proves absolutely nothing about the real world.

除非(除非)社区找到解决失败#2的方法并让真正的企业参与进来以获得有关实际问题的真实数据,否则我们将生活在当前的理论扭曲中。真正的企业和研究界之间的墙必须拆除!

Until (and unless) the community finds a way to solve Failure #2 and to engage real enterprises in order to get real data on real problems, then we will live in the current theory warp. The wall between real enterprises and the research community will have to come down!

结果#4:我们忽视了最困难的问题

Consequence #4: We Are Ignoring the Hardest Problems

大多数企业面临的一个大问题是不同数据源(数据孤岛)的集成。每个大型企业都分为半独立的业务部门,因此可以实现业务敏捷性。然而,这会创建独立构建的“数据孤岛”。人们清楚地认识到,数据孤岛集成对于交叉销售、社交网络、客户单一视图等具有巨大价值。但数据孤岛集成是数据管理的致命弱点,并且有充分的证据证明这一事实。数据科学家经常表示,他们至少将 80% 的时间花在数据集成上,最多剩下 20% 的时间用于他们受雇的任务。许多企业表示数据集成(数据管理)是他们最困难的问题。

A big problem facing most enterprises is the integration of disparate data sources (data silos). Every large enterprise divides into semi-independent business units so business agility is enabled. However, this creates independently constructed “data silos.” It is clearly recognized that data silo integration is hugely valuable, for cross-selling, social networking, single view of a customer, etc. But data silo integration is the Achilles’ heel of data management, and there is ample evidence of this fact. Data scientists routinely say that they spend at least 80% of their time on data integration, leaving at most 20% for the tasks for which they were hired. Many enterprises report data integration (data curation) is their most difficult problem.

那么,我们的社区在做什么呢?20 世纪 80 年代有一些关于数据集成的工作,过去 30 年也有关于联合数据库的工作。然而,联合数据集没有任何价值,除非它们可以被清理、转换和重复数据删除。我认为,针对这个问题或数据清理的努力还不够,这同样困难。

So, what is our community doing? There was some work on data integration in the 1980s as well as work on federated databases over the last 30 years. However, federating datasets is of no value unless they can be cleaned, transformed, and deduplicated. In my opinion, insufficient effort has been directed at this problem or at data cleaning, which is equally difficult.

如果我们忽视了最重要的管理问题,我们怎么能声称拥有数据管理的研究任务呢?我们已经成为一个寻找具有清晰理论基础的问题并产生数学解决方案的社区,而不是试图解决重要的现实世界问题的社区。显然,这种态度将使我们长期处于无关紧要的境地。

How can we claim to have the research mandate of management of data if we are ignoring the most important management problem? We have become a community that looks for problems with a clean theoretical foundation that beget mathematical solutions, not one that tries to solve important real-world problems. Obviously, this attitude will drive us toward long-term irrelevance.

当然,这是需要发表大量论文的明显结果,换句话说,不努力,其结果并不能保证产生一篇论文。同样令人沮丧的是,获得终身教职并不能阻止论文的苦差事,因为你的学生仍然需要写出所需数量的论文才能找到工作。我建议大家在行业中休息一年,深入研究数据质量或数据集成问题。当然,考虑到目前对教师的发表要求,这是一个空洞的建议。

Of course, this is an obvious result of the necessity of publishing mountains of papers, in other words, don’t work on anything hard, whose outcome is not guaranteed to produce a paper. It is equally depressing that getting tenure does not stop this paper grind, because your students still need to churn out the required number of papers to get a job. I would advise everybody to take a sabbatical year in industry and delve into data quality or data integration issues. Of course, this is a hollow suggestion, given the current publication requirements on faculty.

数据集成并不是我们客户面临的唯一极其重要的问题。随着业务条件的变化(数据库设计),模式的演变被严重破坏,真正的客户不遵循我们的传统智慧。据广泛报道,新的 DBMS 产品需要大约 2000 万美元的资本才能达到生产准备状态,就像 20 年前一样。对于一个成熟的学科来说,这是令人震惊的。数据库应用程序仍然需要用户对 DBMS 内部结构有太多了解才能有效地执行优化。

Data integration is not the only incredibly important issue facing our customers. Evolution of schemas as business conditions change (database design) is horribly broken, and real customers don’t follow our traditional wisdom. It is also widely reported that new DBMS products require some $20M in capital to get to production readiness, as they did 20 years ago. For a mature discipline this is appalling. Database applications still require a user to understand way too much about DBMS internals to effectively perform optimization.

换句话说,不乏非常重要的工作要做。然而,它通常无法产生良好的理论或快速结果,并且通常需要处理大量难以获得的丑陋数据。作为一个社区,我们需要重新调整我们的优先事项!

In other words, there is no shortage of very important stuff to work on. However, it often does not make for good theory or quickies and often requires massaging a lot of ugly data that is hard to come by. As a community, we need to reset our priorities!

概括

Summary

我看着我们的领域,它的中间是空心的,并且越来越强调彼此之间几乎没有共性的应用程序。似乎迫切需要重组我们的出版系统。此外,为了完成所需数量的出版物,规避风险和理论化的压力越来越大。这是一种渐进主义的环境,无法实现突破性研究。在我看来,我们正朝着一个糟糕的方向前进。

I look out at our field with its hollow middle and increasing emphasis on applications with little commonality to each other. Restructuring our publication system seems desperately needed. In addition, there is increasing pressure to be risk averse and theoretical, so as to grind out the required number of publications. This is an environment of incrementalism, not one that will enable breakthrough research. In my opinion, we are headed in a bad direction.

在我们社区长辈的开明领导下,我的大部分担忧都是可以纠正的。纸张泛滥可以通过多种方式解决,其中一些方法已在上面提到。我认为,解决中间空洞问题的最好方法是使我们的会议多元化。招募现实世界首先需要证明我们满足他们的需求(解决正确的问题)。其次,这是一个招聘问题,只要花点力气就可以轻松解决。

Most of my fears are rectifiable, given enlightened leadership by the elders of our community. The paper deluge is addressable in a variety of ways, some of which were noted above. The hollow middle is best addressed in my opinion by multi-furcating our conferences. Recruiting the real world, first and foremost, demands demonstrating that we’re relevant to their needs (working on the right problems). Secondarily, it is a recruitment problem, which is easily addressed with some elbow grease.

本章呼吁大家尽快采取行动!如果不采取行动,我强烈怀疑系统人员会分裂,这不利于该领域的团结。在我看来,定于 2018 年秋季由 Magda Balazinska 和 Surajit Chaudhuri 组织的“我们领域的五年评估”应该主要关注本章中的问题。

This chapter is a plea for action, and quickly! If there is no action, I strongly suspect the systems folks will secede, which will not be good for the unity of the field. In my opinion, the “five-year assessment of our field,” which is scheduled for the Fall of 2018 and organized by Magda Balazinska and Surajit Chaudhuri, should focus primarily on the issues in this chapter.

1 . 雅虎云服务基准 ( https://research.yahoo.com/news/yahoo-cloud-serving-benchmark/ )。上次访问时间为 2018 年 3 月 2 日。

1. Yahoo Cloud Serving Benchmark (https://research.yahoo.com/news/yahoo-cloud-serving-benchmark/). Last accessed March 2, 2018.

12

12

Stonebraker 和开源

Stonebraker and Open Source

迈克·奥尔森

Mike Olson

BSD 许可证的起源

The Origins of the BSD License

1977 年,伯克利分校的 Bob Fabry 教授开始与研究生 Bill Joy 一起研究操作系统。Fabry 和他的同事创建了计算机系统研究小组 (CSRG),以探索虚拟内存和网络等想法。贝尔实验室的研究人员创建了一个名为 UNIX™ 的操作系统,可以作为测试他们的想法的良好平台。UNIX 的源代码归实验室的母公司 AT&T 所有,该公司严格控制访问。

In 1977, Professor Bob Fabry at Berkeley began working with a graduate student, Bill Joy, on operating systems. Fabry and his colleagues created the Computer Systems Research Group (CSRG) to explore ideas like virtual memory and networking. Researchers at Bell Labs had created an operating system called UNIX™ that could serve as a good platform for testing out their ideas. The source code for UNIX was proprietary to the Labs’ parent, AT&T, which carefully controlled access.

因为想要增强 UNIX,CSRG 依赖于贝尔实验室的源代码。该小组希望与合作者分享其工作,因此必须以某种方式发布其创建的任何新代码。任何获得伯克利软件副本的人还需要从 AT&T 购买源代码许可证。

Because it wanted to enhance UNIX, CSRG depended on the Bell Labs source code. The group wanted to share its work with collaborators, so would somehow have to publish any new code it created. Anyone who got a copy of the Berkeley software needed to purchase a source code license from AT&T as well.

CSRG 与该大学的知识产权许可办公室合作制定了“伯克利软件分发 (BSD)”许可证。该许可证允许任何人接收、修改并进一步共享他们从 CSRG 获得的代码,从而鼓励协作。它没有对 AT&T 源代码施加任何额外限制——AT&T 源代码许可条款可以继续涵盖这些限制。

CSRG worked with the university’s intellectual property licensing office to craft the “Berkeley Software Distribution (BSD)” license. This license allowed anyone to receive, modify, and further share the code they got from CSRG, encouraging collaboration. It placed no additional restrictions on the AT&T source code—that could continue to be covered by the terms of the AT&T source code license.

BSD 许可证是一个非常巧妙的破解。它保护了 AT&T 的利益,维持了伯克利与贝尔实验室的良好关系。它使伯克利研究人员能够广泛分享他们的创新工作,并收回其他人的贡献。而且,重要的是,它为每个人提供了一种共同工作的方式,以 UNIX 的思想为基础,使其成为一个更好的系统。

The BSD license was a really nifty hack. It protected the interests of AT&T, maintaining the good relationship Berkeley had with Bell Labs. It allowed the Berkeley researchers to share their innovative work broadly, and to take back contributions from others. And, significantly, it gave everyone a way to work together to build on the ideas in UNIX, making it a much better system.

BSD 和安格尔

BSD and Ingres

1976 年,在 CSRG 开始研究 UNIX 之前,Mike Stonebraker 与 Eugene Wong(以及后来的 Larry Rowe)启动了一个研究项目,以测试 Ted Codd [Codd 1970] 和 Chris Date 发表的一些想法。Codd 和 Date 开发了一种“关系模型”,这是一种思考数据库系统的方法,它将计算机上数据的物理布局和组织与计算机上的操作分开。他们认为,你可以描述你的数据,然后说出你想用它做什么。计算机可以解决所有的管道问题,为人们节省大量的麻烦和时间。

In 1976, before CSRG began working on UNIX, Mike Stonebraker had launched a research project with Eugene Wong (and, later, Larry Rowe) to test out some ideas published by Ted Codd [Codd 1970] and Chris Date. Codd and Date developed a “relational model,” a way to think about database systems that separated the physical layout and organization of data on computers from operations on them. You could, they argued, describe your data, and then say what you wanted to do with it. Computers could sort out all the plumbing, saving people a lot of trouble and time.

新项目名为 Ingres,是“交互式图形检索系统”的缩写。

The new project was called Ingres, short for INteractive Graphics REtrieval System.

在 CSRG 完成 BSD 许可证工作之前,Mike 和他的合作者开始与其他机构共享他们的代码。最早的副本是通过磁带运输的,基本上没有受到大学的监督;收件人将承担磁带和运输费用,研究生将邮寄一盒源代码。没有附加明确的版权或许可语言。

Mike and his collaborators began to share their code with other institutions before CSRG finished its work on the BSD license. The earliest copies were shipped on magnetic tape with essentially no oversight by the university; the recipient would cover the cost of the tapes and shipping, and a grad student would mail a box of source code. There was no explicit copyright or licensing language attached.

伯克利知识产权局很快了解到这一情况,并坚持要求安格尔采取新的做法。当时该项目的负责人 Bob Epstein 向 Ingres 邮件列表发送了一封遗憾的电子邮件,解释说该软件现在拥有加州大学伯克利分校的版权,进一步共享或分发该软件需要获得该大学的书面许可。Ingres 团队对这一变化感到失望:他们希望得到广泛采用和协作,而新的法律语言对两者都造成了干扰。

The intellectual property office at Berkeley soon learned of this and insisted that Ingres adopt a new practice. Bob Epstein, a leader on the project at the time, sent a regretful email to the Ingres mailing list explaining that the software now carried a UC Berkeley copyright, and that further sharing or distribution of the software required written permission of the university. The Ingres team was disappointed with the change: they wanted widespread adoption and collaboration, and the new legal language interfered with both.

尽管存在这样的限制,安格尔作为一个纯粹的学术项目蓬勃发展了好几年。研究团队实施了想法,将代码发送给合作者,并获得了有用的反馈。到 1980 年,该项目已经足够成熟,可以成为真正查询工作负载的可靠平台。一直使用旧数据库系统的公司开始对 Ingres 感兴趣,将其作为一种可能的替代方案。

Notwithstanding that limitation, Ingres thrived for several years as a purely academic project. The research team implemented ideas, shipped code to collaborators, and got useful feedback. By 1980, the project had matured enough to be a credible platform for real query workloads. Companies that had been using older database systems were getting interested in Ingres as a possible alternative.

Mike 和他的几位同事决定利用这个机会,创建了 Relational Technology, Inc. (RTI),将他们所做的研究商业化。不幸的是,该软件受到加州大学伯克利分校版权的限制,需要获得该大学的书面许可才能复制或分发。迈克做出了一个大胆的决定:他单方面宣布该软件属于公共领域。RTI 选择了研究代码并将其用作其商业产品的基础。

Mike and several of his colleagues decided to capitalize on the opportunity and created Relational Technology, Inc. (RTI) to commercialize the research they had done. The software, unfortunately, was under the restrictive UC Berkeley copyright, which required written permission by the university to reproduce or distribute. Mike made an audacious decision: he unilaterally declared the software to be in the public domain. RTI picked up the research code and used it as the foundation for its commercial offering.

回想起来,很难相信它有效。年轻教授通常不会与雇主的法律部门发生矛盾,尤其是为了自己的经济利益。迈克本人也没有明确解释他是如何逃脱惩罚的。最有可能的是,很简单,没有人注意到。

In retrospect, it is hard to believe that it worked. Young professors do not often contradict the legal departments of their employers, especially for their own financial benefit. Mike has no clear explanation himself for how he got away with it. Most likely, quite simply, no one noticed.

不久之后,CSRG 团队完成了与该大学法律部门的合作并发布了 BSD 许可证。一旦它出现,Mike 也很快将其用于 Ingres 项目源代码。它满足了他们的所有目标——免费使用、扩展和增强以及进一步共享;不歧视商业使用或再分发。最重要的是,CSRG 已经让伯克利的律师同意该语言,因此对于 Ingres 使用它没有合理的反对意见。这阻止了对迈克短暂的公共领域叛乱的任何挑战。

Very soon afterward, the CSRG team finished its work with the university’s legal department and published the BSD license. Once it existed, Mike quickly adopted it for the Ingres project source code as well. It satisfied all their goals—freely available to use, extend and enhance, and share further; no discrimination against commercial use or redistribution. Best of all, CSRG had already gotten the Berkeley lawyers to agree to the language, so there was no reasonable objection to its use for Ingres. This forestalled any challenge to Mike’s brief public-domain insurrection.

没有人确切记得第一个 Ingres 源代码磁带在 BSD 许可证下发布的具体时间,但那是一个重要的日子。它使 Ingres 成为世界上第一个完整的、独立的、以开源形式发布的重要系统软件。CSRG 正在发布 BSD 代码,但它需要 AT&T 许可的代码来构建;Ingres 的编译和运行无需依赖第三方专有组件。

Nobody remembers exactly when the first Ingres source code tapes went out under the BSD license, but it was an important day. It made Ingres the world’s first complete, standalone substantial piece of systems software distributed as open source. CSRG was shipping BSD code, but it needed the AT&T-licensed code to build on; Ingres compiled and ran without recourse to third-party, proprietary components.

安格尔的影响

The Impact of Ingres

Ingres 项目帮助创建了关系数据库行业,为各种其他技术创新奠定了基础。Ingres 与 IBM 的 System R 一起将 Codd 和 Date 所拥护的理论付诸实践。有关 Ingres 项目的历史和详细信息,请参阅第 15 章“Ingres 岁月”。

The Ingres project helped to create the relational database industry, which provided a foundation for all sorts of other technological innovations. Along with System R at IBM, Ingres turned the theory that Codd and Date espoused into practice. The history and detail of the Ingres project are available in Chapter 15, “The Ingres Years.”

我自己的工作是在 Postgres 项目上(更多信息请参见第 16 章)。与 Ingres 一样,Postgres 使用 BSD 许可证进行源代码分发。Postgres 在三个重要方面是对 Ingres 成功的回应。

My own work was on the Postgres project (see Chapter 16 for more information). Like Ingres, Postgres used the BSD license for source code distributions. Postgres was a reaction to the success of Ingres in three important ways.

首先,Ingres 是一个非常成功的研究项目和开源数据库。然而,到了 20 世纪 80 年代中后期,很明显,有趣的问题几乎已经被提出并得到了回答。研究人员的目标是针对棘手问题开展原创性工作。在安格尔,没有更多值得做的博士工作要做。

First, Ingres was a remarkably successful research project and open source database. By the mid- to late-1980s, however, it was clear that the interesting questions had pretty much been asked and answered. Researchers aim to do original work on tough problems. There just wasn’t a lot more Ph.D.-worthy work to be done in Ingres.

其次,Mike 启动了 RTI,将研究项目推向商业市场。公司和研究项目共存了一段时间,但这种情况不可能永远持续下去。大学和国家科学基金会都不太可能为该公司的开发提供资金。迈克必须将他的研究与商业利益分开。Ingres 项目必须结束,这意味着 Mike 需要一些新的工作。

Second, Mike had started RTI to bring the research project to the commercial market. The company and the research project coexisted for a while, but that could not continue forever. Neither the university nor the National Science Foundation was likely to fund development for the company. Mike had to separate his research from his commercial interests. The Ingres project had to end, and that meant Mike needed something new to work on.

最后,也是更根本的,安格尔在重要方面限制了科德和戴特的关系理论。项目团队选择了他们最感兴趣的数据类型和操作,以及最容易在 UNIX 上用 C 编程语言实现的数据类型和操作。Ingres 和 IBM 的 System R 一起充当了20 世纪 80 年代和 90 年代出现的所有关系数据库供应商的参考实现(第 13 章)。他们大多选择了这两个研究项目已实施的相同数据类型、操作和语言。

Finally, and more fundamentally, Ingres had constrained Codd and Date’s relational theory in important ways. The project team chose the data types and the operations most interesting to them, and those easiest to implement in the C programming language on UNIX. Together, Ingres and IBM’s System R had served as reference implementations for all the relational database vendors that cropped up in the 1980s and 1990s (Chapter 13). They mostly chose the same datatypes, operations, and languages that those two research projects had implemented.

后安格尔时代

Post-Ingres

迈克认为,关系模型不仅仅支持那些早期的数据类型和操作,而且产品还可以做更多的事情。如果您可以存储图形和地图,而不仅仅是整数和日期,该怎么办?如果你可以问有关“附近”而不仅仅是“小于”的问题,会怎么样?他认为软件本身可以更智能、更活跃。如果您不仅可以看到当前信息,还可以看到数据库的整个历史记录怎么办?如果您可以定义有关表中数据的规则并且软件强制执行这些规则会怎么样?

Mike argued that the relational model supported more than just those early data types and operations, and that products could do more. What if you could store graphs and maps, not just integers and dates? What if you could ask questions about “nearby” not just “less than?” He argued that the software itself could be smarter and more active. What if you could see not just current information, but the whole history of a database? What if you could define rules about data in tables, and the software enforced them?

类似的功能在当今的数据库产品中很常见,但在 20 世纪 80 年代中期,没有任何商业供应商考虑它们。Mike 创建了 Postgres 项目——为了“后安格尔”,因为大学教授并不总是擅长品牌——来探索这些和其他想法。

Features like that are commonplace in database products today, but in the middle 1980s none of the commercial vendors were thinking about them. Mike created the Postgres project—for “post-Ingres,” because college professors aren’t always great at brand names—to explore those and other ideas.

作为一种研究工具和研究生学位的引擎,Postgres 取得了惊人的成功。像任何真正的研究项目一样,它尝试了一些失败的东西:“不可覆盖”存储系统和“时间旅行”[Stonebraker 等人。1990b,Stonebraker 和 Kemnitz 1991,Stonebraker 1987]很有趣,但从未找到将它们广泛应用于工业的商业应用。其他想法确实流行起来,但并不总是以 Postgres 的方式实现:Mike 的学生实施了规则 [Potamianos 和 Stonebraker 1996,Stonebraker 等人 1996]。1988a,斯通布雷克等人。1989]以几种不同的方式应用于表(“查询重写”规则系统和“元组级”规则系统)。如今大多数数据库都支持规则,但它们是建立在这些系统的基础上的,而 Postgres 尝试过的方式并不常见。然而,有些想法确实得到了广泛采用,就像 Postgres 设计它们的方式一样。当今数据库系统中的抽象数据类型 (ADT) 和用户定义函数 (UDF) 明确基于 Postgres 架构;对空间数据和其他复杂类型的支持很常见。

As a research vehicle and as an engine for graduate degrees, Postgres was phenomenally successful. Like any real research project, it tried out some things that failed: The “no-overwrite” storage system and “time travel” [Stonebraker et al. 1990b, Stonebraker and Kemnitz 1991, Stonebraker 1987] were interesting, but never found a commercial application that pulled them into widespread use in industry. Other ideas did take hold, but not always in the way that Postgres did them: Mike’s students implemented rules [Potamianos and Stonebraker 1996, Stonebraker et al. 1988a, Stonebraker et al. 1989] that applied to tables in a couple of different ways (the “query rewrite” rules system and the “tuple-level” rules system). Most databases support rules today, but they’re built on the foundations of those systems, and not often in the ways that Postgres tried out. Some ideas, however, did get widespread adoption, much in the way that Postgres designed them. Abstract data types (ADTs) and user-defined functions (UDFs) in database systems today are based expressly on the Postgres architecture; support for spatial data and other complex types is commonplace.

20 世纪 90 年代初,Ingres 循环在 Postgres 中重演。大多数基本的新研究思路都已得到探索。该项目非常成功,足以相信这些想法会有商业需求。Mike 于 1992 年创立了 Montage(很快更名为 Miro,然后再次更名为 Illustra),以基于该项目构建产品,并将他的研究重点转移到其他地方。Illustra 于 1996 年被 Informix 收购,这家更大的公司整合了许多Postgres 的功能已融入其通用数据库产品中。Informix 后来又被 IBM 收购。

In the early 1990s, the Ingres cycle repeated itself with Postgres. Most of the fundamental new research ideas had been explored. The project had been successful enough to believe that there would be commercial demand for the ideas. Mike started Montage (soon renamed Miro, then renamed again Illustra) in 1992 to build a product based on the project, and moved his research focus elsewhere. Illustra was acquired by Informix in 1996, and the bigger company integrated many of Postgres’ features into its Universal Database product. Informix was, in turn, later acquired by IBM.

开源对研究的影响

The Impact of Open Source on Research

如果你今天问 Mike,他会告诉你为 Ingres 使用 BSD 许可证的决定只是运气不好。他当时就在场,所以我们应该相信他的话。然而,无论他是否是 20 世纪 70 年代末的开源梦想家,他显然都是早期开源运动的重要人物。Ingres 作为开源软件的成功改变了研究的方式。此外,它还帮助塑造了科技行业。Mike 现在支持在研究(第 10 章)和初创企业(第 7 章)中使用开源。

If you ask Mike today, he will tell you that the decision to use the BSD license for Ingres was just dumb luck. He was there at the time, so we should take him at his word. Whether he was an open-source visionary or not in the late 1970s, however, it’s clear that he was an important figure in the early open-source movement. Ingres’ success as open source changed the way that research is done. It helped, besides, to shape the technology industry. Mike now endorses using open source in both research (Chapter 10) and in startups (Chapter 7).

就像 Ingres 团队所做的那样,我们这些从事 Postgres 工作的人拥有大多数研究生所缺乏的优势:我们拥有真正的用户。我们会在项目 FTP 站点上发布一个版本(年轻人,那时候还没有万维网!),然后我们会看到人们从世界各地下载它。我们没有提供任何正式支持,但我们有一个邮件列表,人们可以向其发送疑问和问题,我们会尽力回答它们。我仍然记得当我们收到来自俄罗斯核电设施的错误报告时,我们感到自豪和焦虑的混合体——他们不应该运行拥有质量保证团队的代码吗?

Just like the Ingres team did, those of us working on Postgres had an advantage that most graduate students lacked: we had real users. We’d publish a build on the project FTP site (no World Wide Web back then, youngsters!) and we’d watch people download it from all over the planet. We didn’t offer any formal support, but we had a mailing list people could send questions and problems to, and we’d try to answer them. I still remember the blend of pride and angst we felt when we got a bug report from a Russian nuclear power facility—shouldn’t they have been running code that had a quality assurance team?

以开源方式发布研究成果的决定使伯克利成为一流的系统学校,至今仍然如此。其他学校的研究生写论文。我们发布了软件。哦,当然,我们也写过论文,但我们的论文得到了极大的改进,因为我们发布了该软件。我们可以轻松地与世界各地的同事协作。我们不仅在理论上了解了我们的想法如何发挥作用,而且还在现实世界中的实践中发挥了作用。

The decision to publish research as open source established Berkeley as the first-class systems school it still is today. Grad students at other schools wrote papers. We shipped software. Oh, sure, we wrote papers, too, but ours were improved tremendously because we shipped that software. We could collaborate easily with colleagues around the globe. We learned how our ideas worked not just in theory, but also in practice, in the real world.

代码是科学方法的绝佳载体。代码使其他人可以非常轻松地测试您的假设并重现您的结果。协作者可以轻松地增强和扩展代码,以您自己的新想法为基础构建您的原始想法。

Code is a fantastic vector for the scientific method. Code makes it incredibly easy for others to test your hypothesis and reproduce your results. Code is easy for collaborators to enhance and extend, building on your original ideas with new ones of their own.

Ingres 作为第一个大型开源系统软件项目的成功影响了世界各地大学教师的思想,尤其是伯克利大学。事实上,如今伯克利的所有系统都是开源的。它与言论自由和 Top Dog 一样是大学的基础。1

Ingres’ success as the first large open-source systems software project influenced the thinking of faculty at universities around the world, but especially at Berkeley. Virtually all systems work at Berkeley today is open source. It’s as fundamental to the university as free speech and Top Dog.1

当然,斯通布雷克本人也吸取了教训。他在 Ingres 上看到了 BSD 许可证的好处,并在 Postgres 上再次使用它。开源给Postgres带来了巨大的行业影响。

Stonebraker himself learned the lesson, of course. He had seen the benefits of the BSD license with Ingres and used it again on Postgres. Open source gave Postgres a tremendous impact on the industry.

例如,与安格尔不同的是,该项目在大学关闭后仍然幸存下来。两名前学生 Jolly Chen 和 Andrew Yu 启动了一个个人项目,用当时的行业标准 SQL 替换 Postgres 的“postquel”查询语言。他们重写了查询解析器并建立了一个新的项目页面。为了向历史和他们所做的辛勤工作致敬,他们将新包命名为“PostgreSQL”。

For example, unlike Ingres, the project survived its shutdown by the university. Two former students, Jolly Chen and Andrew Yu, launched a personal project to replace Postgres’ “postquel” query language with the by-then-industry-standard SQL. They rewrote the query parser and put up a new project page. In a nod to history and to the hard work they’d done, they named the new package “PostgreSQL.”

他们的工作引起了大学以外人士的注意。如今,PostgreSQL 在世界各地拥有充满活力的开发者和用户社区。PostgreSQL 仍然是一个值得自豪的独立项目,广泛部署,为贡献者、用户、支持组织、顾问和其他人的生态系统创造价值和机会。在我撰写本章时,该项目的开发中心显示了针对超过 100 万行代码的 44,000 次提交。如今至少有两家独立公司推出了其商业版本。除此之外,几乎每个版本的 Linux 都捆绑了一个副本,亚马逊还提供了一个在云中使用的托管版本。所有这些都只是一个样本。那里有很多 PostgreSQL。

Their work attracted the attention of folks outside the university. Today, PostgreSQL has a vibrant developer and user community around the world. PostgreSQL remains a proudly independent project, deployed widely, creating value and opportunity for an ecosystem of contributors, users, support organizations, consultants, and others. At the time I am writing this chapter, the project’s development hub shows 44,000 commits against more than one million lines of code. At least two standalone companies ship commercial versions of it today. Besides that, virtually every version of Linux bundles a copy, and Amazon offers a hosted version for use in the cloud. All of that is just a sampling. There’s a whole lot of PostgreSQL out there.

让纳税人资助的研究走出大学,让公民受益,这一点很重要。开源使这变得更容易。迈克和安格尔是第一个在政府资助下创建实质性创新开源系统软件的人,然后创办了一家公司,将大学内部创建的知识产权精确商业化。该模型运作良好。此后,他在 Postgres 以及伯克利和其他大学的许多其他项目中重复了这一点。

Getting taxpayer-funded research out of the university so that it can benefit citizens is important. Open source makes that easier. Mike, with Ingres, was the first person to create a substantial and innovative piece of open-source systems software with government funding, and then to start a company to commercialize precisely the IP created inside the university. The model worked well. He repeated it with Postgres and many other projects at Berkeley and other universities since.

这向伯克利和其他地方的教授们表明,他们可以在学院工作的同时仍然参与市场。教授和研究生长期以来一直在研究中冒险,然后创办公司将产品推向市场。开源消除了摩擦:代码就在那里,一切准备就绪,获得许可。我们是否因此而吸引更优秀的年轻人从事研究事业尚不得而知;当然,我们为那些选择推进最先进技术的人提供了一种参与他们创造的价值的方式。

That showed professors at Berkeley and elsewhere that they could work in the academy and still participate in the marketplace. Professors and grad students have long taken risks in research, and then started companies to bring products to market. Open source eliminates friction: the code is there, all set to go, permission granted. Whether we attract better young people to careers in research because of this is unknowable; certainly, we give those who choose to advance the state of the art a way to participate in the value that they create.

获得经济利益的不仅仅是迈克和像他这样的同事。

And it is not only Mike and colleagues like him who benefit financially.

由于宽松的 BSD 许可证,Postgres 和 PostgreSQL 可供任何想要基于该项目创办公司的人使用。很多人都这么做了。Netezza、Greenplum、Aster Data 和其他公司采用并改编了该代码。它的某些部分——例如查询解析器——已经进入了其他产品。这节省了数百万美元的前期研发成本。它让企业否则可能永远不会开始。所有这些公司的客户、员工和投资者都受益匪浅。

Because of the permissive BSD license, Postgres and PostgreSQL were available to anyone who wanted to start a company based on the project. Many did. Netezza, Greenplum, Aster Data, and others adopted and adapted the code. Pieces of it—the query parser, for example—have found their way into other products. That saved many millions in upfront research and development costs. It made companies possible that might never have started otherwise. Customers, employees, and the investors of all those companies benefited tremendously.

我自己的职业生涯在很大程度上要归功于源自 Ingres 项目的软件许可和关系数据库行业的创新。我曾是两家初创公司 Illustra 和 Sleepycat 的一员,这两家公司是专门为了将加州大学伯克利分校的开源数据库商业化而创建的。我现在的公司 Cloudera 以我从 Mike 那里学到的经验为基础,使用来自消费者互联网的开源数据管理技术:用于大数据的 Apache Hadoop 项目及其所催生的丰富的生态系统。

My own career owes a great deal to the innovation in software licensing and the relational database industry that descended from the Ingres project. I have been part of two startups, Illustra and Sleepycat, created expressly to commercialize open-source UC Berkeley databases. My current company, Cloudera, builds on the lessons I’ve learned from Mike, using open-source data management technology from the consumer internet: the Apache Hadoop project for big data, and the rich ecosystem it has spawned.

更广泛地说,整个数据库社区(工业界和学术界)都非常感谢 Ingres 及其在开源软件中的实现。工作系统的现成可用意味着其他人——在无力建造或购买自己的系统的大学中,以及在无力承担迈克在伯克利所做的蓝天研究的公司中——可以探索一种新的方法。真实的工作系统。他们可以从其创新中学习并发挥其优势。

More broadly, the entire database community—industry and academia—owes a great deal to Ingres, and to its implementation in open-source software. The ready availability of a working system meant that others—at universities that couldn’t afford to build or buy their own systems, and at companies that couldn’t afford to fund the blue-sky research that Mike did at Berkeley—could explore a real working system. They could learn from its innovations and build on its strengths.

2015 年,Michael Stonebraker 博士因其对关系数据库社区的终身贡献而荣获 2014 年 AM 图灵奖。毫无疑问,Mike 在伯克利和其他地方领导的关于 Ingres、Postgres、列式存储等方面的非常具体、非常直接的工作值得获得该奖项。关系数据库市场的存在在很大程度上要归功于他的工作。该市场每年产生数千亿美元的商业活动。他在职业生涯中对每一个有意义的项目都创新性地使用了宽松的开源许可证,极大地放大了这项工作。它允许每个人——商业部门、研究界和迈克本人——在他的创新之上创造价值。

In 2015, Dr. Michael Stonebraker won the 2014 A.M. Turing Award for lifetime contributions to the relational database community. There’s no question that the very specific, very direct work that Mike led at Berkeley and elsewhere on Ingres, Postgres, columnar storage, and more deserves that award. The relational database market exists in no small part because of his work. That market generates hundreds of billions of dollars in commercial activity every year. His innovative use of a permissive open-source license for every meaningful project he undertook in his career amplified that work enormously. It allowed everyone—the commercial sector, the research community, and Mike himself—to create value on top of his innovation.

Ingres 选择 BSD 很可能是幸运的,但根据我的经验,幸运总是降临在那些走上这条道路的人身上。如此多的研究、如此多的创新都是艰苦的工作。为 Ingres 选择 BSD 许可证的惰性——获得广泛分发的所有好处、所有协作的力量,而无需与加州大学伯克利分校的法律团队发生争执——这让 Ingres 非常幸运。

The choice of BSD for Ingres may well have been lucky, but in my experience, luck comes soonest to those who put themselves in its path. So much of research, so much of innovation, is just hard work. The inspired laziness of choosing the BSD license for Ingres—getting all the benefits of broad distribution, all the power of collaboration, without the trouble of a fight with the UC Berkeley legal team—put Ingres smack in the way of lucky.

我们都从第一个具有巨大影响力的开源系统项目中学到了很多东西。

We all learned a great deal from that first, hugely impactful, open-source systems project.

1 . 伯克利的一家历史悠久(自 1966 年以来)但不太健康的餐馆。

1. A long-standing (since 1966), not-so-healthy diner at Berkeley.

13

13

关系数据库管理系统谱系

The Relational Database Management Systems Genealogy

菲利克斯·瑙曼

Felix Naumann

数据库系统(尤其是关系型数据库系统)的历史可以追溯到计算机科学学科的起源。很少有计算机科学的其他领域能够回顾几十年来并展示概念、系统和思想是如何生存和繁荣到今天的。受到我博士的数据库讲座的启发。顾问 Christoph Freytag 总是在他的课程中包含一小段关于 DBMS 历史的内容,我在自己为本科生准备的幻灯片中包含了一些此类材料(参见 2010 年的图 13.1 。我对 DBMS 历史(和幻灯片布局)的有限看法是显而易见的。

The history of database systems, in particular the relational kind, reaches far back to the beginnings of the computer science discipline. Few other areas of computer science can look back as many decades and show how concepts, systems, and ideas have survived and flourished to the present day. Inspired by the database lectures of my Ph.D. advisor Christoph Freytag, who always included a short section on the history of DBMS in his courses, I included some such material in my own slides for undergraduate students (see Figure 13.1 from 2010). My limited view of DBMS history (and slide layout) is apparent.

后来,在我作为教授的第一门数据库课程的第一周,我提供了很多介绍性材料,但缺乏硬性内容来填写练习表和吸引学生。因此,我让学生从一长串著名的 DBMS 中选择一个系统,并要求他们研究它的历史并收集有关其起源、日期、版本等的数据。我和我的助教一起建立了一个初步的更完整的系统。时间表,预测了最新版本的设计(见图13.2)。一位平面设计师建议我们应用地铁地图的比喻,并于 2012 年创建了第一个家谱海报版本,就像今天出现的那样(图 13.3)。多年之后的版本,更多的系统和节点,当前的2017年谱系如图所示图 13.4。显然,它目前已经变得更加密集(图表右侧),而且基于早期 RDBMS 的各种发现,对于 RDBMS 的初期(左侧)也提供了更多信息。

Later, in the first week of my first database course as a professor, I had presented much introductory material, but lacked hard content to fill the exercise sheets and occupy the students. Thus, I let students choose a system from a long list of well-known DBMSs and asked them to research its history and collect data about its origin, dates, versions, etc. Together with my teaching assistants, I established an initial more-complete timeline, which anticipated the design of the latest version (see Figure 13.2). A graphic designer suggested that we apply a subway-map metaphor and in 2012 created the first poster version of the genealogy, as it appears today (Figure 13.3). Many years and versions later, and many systems and nodes more, the current 2017 genealogy is shown in Figure 13.4. Clearly, it has grown much denser for the present time (at the right of the chart), but also much more informative for the beginnings of RDBMS (at left), based on various discoveries of early RDBMSs.

图像

图 13.1  数据库讲座中的谱系幻灯片(2010 年)。

Figure 13.1  Genealogy slide from database lecture (2010).

总体而言,该谱系包含 98 个 DBMS 节点、48 个收购、34 个分支和 6 个合并。许多 DBMS 已不复存在 — 19 个被标记为已停产。

Overall, the genealogy contains 98 DBMS nodes, 48 acquisitions, and 34 branches and 6 mergers. Many of the DBMSs are no longer in existence—19 are marked as discontinued.

创建这样的谱系是一个有点不科学的过程:系统的具体开始日期通常是不确定的或难以确定。对于图表中显示的几乎所有其他事件的时间点也是如此。因此,时间被模糊地对待——节点仅被放置在大约几十年内。更困难的是分支的处理,这一功能使图表特别有趣。我们非常慷慨地承认一个分支:它可以表示实际的代码分叉、许可协议或具体的想法转移,或者它可以简单地反映研究人员和开发人员搬迁到新雇主并重新建立 DBMS 或其核心那里的想法。没有自动创建此图表的结构化源。每个节点、每条线都是手动放置的,

Creating a genealogy like this is a somewhat unscientific process: the concrete start date for a system is usually undefined or difficult to establish. The same is also true for the points in time of almost all other events shown in the chart. Thus, time is treated vaguely—nodes are placed only approximately within their decades. Even more difficult is the treatment of branches, a feature that makes the chart especially interesting. We were very generous in admitting a branch: it could signify an actual code fork, a licensing agreement, or a concrete transfer of ideas, or it could simply reflect researchers and developers relocating to a new employer and re-establishing the DBMS or its core ideas there. There is no structured source from which this chart is automatically created. Every node and every line are manually placed, carefully and thoughtfully.

图像

图 13.2  家谱的第一个海报版本(2012 年)。

Figure 13.2  First poster version of the genealogy (2012).

另一个重要的问题仍然是要包括哪些数据库系统。我们一直严格要求仅包含至少支持一些基本 SQL 功能的关系系统。层次结构和面向对象的系统被排除在外,XML、图形数据库或键值存储也被排除在外。纳入的另一个标准是系统至少具有一定程度的分布或一定的用户基础。仅用于研究实验的简单研究原型不符合该描述。话虽这么说,我们在承认系统方面采取了一些自由态度,并且仍然欢迎任何支持或反对特定系统的反馈。

Another important question was and remains which database systems to include. We have been strict about including only relational systems supporting at least some basic SQL capabilities. Hierarchical and object-oriented systems are excluded, as are XML or graph-databases or key-value stores. Another criterion for inclusion is that the system has at least some degree of distribution or some user base. A simple research prototype that was used only for research experiments would not fit that description. That being said, we took some liberty in admitting systems and still welcome any feedback for or against specific systems.

图像

图 13.3  地铁地图谱系第一版(2012 年)。

Figure 13.3  Version 1 of subway-map genealogy (2012).

事实上,自2012年谱系初步设计出版后,增删改正的主要来源是每个新版本发布后专家、研究人员和开发人员的电子邮件反馈。多年来,有一百多人联系过我,其中有很多建议和指正。最短的电子邮件只包含一些不起眼的 DBMS 的 URL;迄今为止,最多的意见来自大卫·迈尔。特别是对于 RDBMS 的早期历史,我依赖于许多其他数据库研究和开发领域的领先专家的反馈,包括(按时间顺序排列)Martin Kersten、Gio Wiederhold、Michael Carey、Tamer Oszu、Jeffrey Ullman、Erhard Rahm、戈茨·格雷夫等人。如果没有他们的经历和记忆,谱系不会以目前的形式存在,这显示了我们领域令人印象深刻的发展。对于本书中包含的最新版本,麻省理工学院 CSAIL 的 Michael Brodie 发起了与 Joe Hellerstein、Michael Carey、David DeWitt、Kapali Eswaran、Michael Stonebraker 和其他几个人的对话,挖掘了各种新系统和新连接(有时是潦草的)课堂笔记),对可以追溯到迈克尔·斯通布雷克(Michael Stonebraker)作品的众多笔记略有偏见。这些 DBMS 可以在整个谱系中找到,使 Michael 成为其中一些 DBMS 的曾曾祖父。从该图表左上角的 Ingres 开始,您不仅可以到达许多商业和非商业系统,还可以找到隐藏在谱系中的较小项目,例如 Mariposa、H-Store 和 C-Store。

In fact, after the initial design and publication of the genealogy in 2012, the main source for additions, removals, and corrections was email feedback from experts, researchers, and developers after the announcement of each new version. Over the years, more than 100 persons contacted me, in parts with very many suggestions and corrections. The shortest email included nothing but a URL to some obscure DBMS; the most input by far came from David Maier. Especially for the early history of RDBMS, I relied on feedback from many other leading experts in database research and development, including (in chronological order of their input) Martin Kersten, Gio Wiederhold, Michael Carey, Tamer Oszu, Jeffrey Ullman, Erhard Rahm, Goetz Graefe, and many others. Without their experience and recollection, the genealogy would not exist in its current form, showing the impressive development of our field. For the most recent version, which is included in this book, Michael Brodie of MIT CSAIL initiated conversations with Joe Hellerstein, Michael Carey, David DeWitt, Kapali Eswaran, Michael Stonebraker, and several others, unearthing various new systems and new connections (sometimes scribbled on classroom notes), with a slight bias towards the numerous ones that can be traced back to Michael Stonebraker’s work. These DBMSs can be found all over the genealogy, making Michael the great-great-grandfather of some of them. Starting from Ingres at the top left of that chart, you not only can reach many commercial and non-commercial systems, but also find smaller projects hidden in the genealogy, such as Mariposa, H-Store, and C-Store.

虽然我没有下载统计数据,但在电视采访 IT 专家期间,我偶尔会在背景中看到印刷海报,他们将其挂在办公室墙上。下载、打印和使用图表都是免费的。请在http://hpi.de/naumann/projects/rdbms-genealogy.html找到最新版本。并且一如既往,欢迎补充和更正。

While I do not have download statistics, I have occasionally seen the printed poster in the background during television interviews with IT experts who had hung it on their office walls. Downloading, printing, and using the chart is free. Please find the latest version at http://hpi.de/naumann/projects/rdbms-genealogy.html. And, as always, additions and corrections are welcome.

图像

图 13.4  今天的地铁地图谱系。

Figure 13.4  Subway-map genealogy today.

第七部分

PART VII

系统贡献

CONTRIBUTIONS BY SYSTEM

概述,第 14 章

Overview, Chapter 14

VII.A 按系统划分的研究贡献,第 15-23 章

VII.A Research Contributions by System, Chapters 15–23

VII.B 建筑系统的贡献,第 24-31 章

VII.B Contributions from Building Systems, Chapters 24–31

本部分的章节是成对出现的。VII.A 中的研究章节在 VII.B 中具有相应的系统章节。第 15 章中描述的研究结果导致了第 24 章中描述的系统结果(安格尔) ;第 16 章中描述的研究结果导致了第 25 章中描述的系统结果(Postgres) ;等等。

Chapters in this part are in pairs. Research chapters in VII.A have corresponding systems chapters in VII.B. Research results described in Chapter 15 led to systems results (Ingres) described in Chapter 24; research results described in Chapter 16 led to systems results (Postgres) described in Chapter 25; and so forth.

14

14

迈克·斯通布雷克 (Mike Stonebraker) 的研究贡献:概述

Research Contributions of Mike Stonebraker: An Overview

塞缪尔·马登

Samuel Madden

正如前面的章节所表明的那样,Mike Stonebraker 有着非凡的职业生涯,在长达 50 年的工作生涯中,至少拥有(具体取决于如何计算)八个极具影响力的数据库系统,其中许多系统得到了商业公司的支持。在本节的各章中,Mike 在这些项目和系统上的合作者着眼于他们对计算的技术贡献。对于他的每个主要系统,都有两章:一章强调智力、研究和商业影响以及迈克在构思这些想法中的作用,另一章描述软件工件和代码线本身。与所有大型软件系统一样,这些项目并不是 Mike 独自完成的,但在所有情况下,它们的成功和影响力都因 Mike 的参与而放大。

As preceding chapters make clear, Mike Stonebraker has had a remarkable career, with at least (depending on how one counts) eight incredibly influential database systems, many of which were backed by commercial companies, spanning five decades of work. In the chapters in this section, Mike’s collaborators on these projects and systems look at their technical contributions to computing. For each of his major systems, there are two chapters: one highlighting the intellectual, research, and commercial impact and Mike’s role in crafting these ideas, and the other describing the software artifacts and codelines themselves. Like all large software systems, these projects were not Mike’s alone, but in all cases their success and influence were magnified by Mike’s involvement. Our goal in these writings is not to exhaustively recap the research contributions of the work, but instead capture a bit of what it was like to be there with Mike when the ideas emerged and the work was done.

与迈克交往的技术规则

Technical Rules of Engagement with Mike

在深入探讨技术细节之前,值得反思一下与 Mike 合作时的一般技术规则。

Before diving into the technical details, it’s worth reflecting a bit on the general technical rules of engagement when working with Mike.

首先,迈克是一位令人难以置信的合作者。即使已经 74 岁高龄,他每次参加会议时都会带着新的想法进行讨论,而且几乎总是第一个自愿写下想法或起草提案。尽管他是至少两家公司的首席技术官,但他似乎立即就完成了这些研究草案,让其他人争先恐后地跟上他和他的想法。

First, Mike is an incredible collaborator. Even at 74, he comes to every meeting with new ideas to discuss and is almost always the first to volunteer to write up ideas or draft a proposal. Despite being the CTO of at least two companies, he fires off these research drafts seemingly instantaneously, leaving everyone else scrambling to keep up with him and his thinking.

其次,在研究方面,迈克只专注于数据库系统——他对计算机科学的其他领域甚至计算机系统不感兴趣。这种关注对他很有帮助,扩大了他在该领域的影响力,并定义了他构建的系统和他创建的公司的范围。

Second, in research, Mike has a singular focus on database systems—he is not interested in other areas of computer science or even computer systems. This focus has served him well, magnifying his impact within the area and defining the scope of the systems he builds and companies he founds.

第三,迈克最看重简单、实用的想法。像所有优秀的系统构建者一样,他力求消除复杂性,转而追求实用性、简单性和可用性——这是他成功的部分原因,尤其是在商业企业中。他倾向于忽视任何他认为复杂的东西,并且通常更喜欢简单的启发法而不是复杂的算法。通常这会奏效,但也并非没有陷阱。例如,如第 18 章和第 27 章所述,Mike 强烈认为 C-Store 及其商业后代不应使用传统的基于动态规划的算法进行连接排序;相反,他相信一种更简单的方法,假设所有表都以“星形”或“雪花”模式排列,并且只允许以这种方式排列的表之间进行连接。此类模式在 Vertica 设计的数据仓库市场中很常见,并且可以通过简单的启发式方法有效地优化此类查询。然而,最终Vertica必须实现一个真正的查询优化器,因为客户需要它,也就是说,复杂的设计实际上是正确的!

Third, Mike values simple, functional ideas above all else. Like all good systems builders, he seeks to eliminate complexity in favor of practicality, simplicity, and usability—this is part of the reason for his success, especially in commercial enterprises. He tends to dismiss anything that he perceives as complicated, and often prefers simple heuristics over complex algorithms. Frequently this works out, but it is not without pitfalls. For example, as described in Chapter 18 and Chapter 27, Mike felt strongly that C-Store and its commercial offspring should not use a conventional dynamic-programming-based algorithm for join ordering; instead, he believed in a simpler method that assumed that all tables were arranged in a “star” or “snowflake” schema, and only allowed joins between tables arranged in this way. Such schemas were common in the data warehouse market that Vertica was designed for, and optimizing such queries could be done effectively with simple heuristics. Ultimately, however, Vertica had to implement a real query optimizer, because customers demanded it, that is, the complicated design was actually the right one!

第四,迈克倾向于黑白分明地看待想法:一个想法要么很棒,要么很糟糕,而且不止一位研究人员对他愿意将他们的建议视为“糟糕”而驳回感到惊讶(见第 21 章。但迈克具有可塑性:即使在驳回一个想法之后,他也可以相信其他观点。例如,他最近开始在自动化统计和人工智能数据集成方法领域工作,并创立了 Tamr 公司(见第30 章),尽管多年来他一直认为用人工智能技术解决这个问题是不切实际的,因为它们的复杂性和正如他的一些合作者在本简介后面所描述的那样,无法扩展。

Fourth, Mike tends to see ideas in black and white: either an idea is great or it is awful, and more than one researcher has been taken aback by his willingness to dismiss their suggestions as “terrible” (see Chapter 21). But Mike is malleable: even after dismissing an idea, he can be convinced of alternative viewpoints. For example, he recently started working in the area of automated statistical and AI approaches to data integration and founded the company, Tamr (see Chapter 30), despite years of arguing that this problem was impractical to solve with AI techniques due to their complexity and inability to scale, as some of his collaborators describe later in this introduction.

第五,尽管迈克确实很快就会形成观点,并且并不回避分享这些观点(有时是以有争议的方式),但他通常是对的。一个典型的例子是他的俏皮话(在 2008 年与 David DeWitt 合作的博客文章中 [DeWitt and Stonebraker 2008]),Hadoop 是“一大倒退”,这导致互联网上的一些人宣称 Mike 已经“跳过了鲨鱼”。然而,这篇文章的观点——大多数数据处理最好使用类似 SQL 的语言来完成——是有先见之明的。Hadoop 很快就被取代了。如今,大多数后 Hadoop 系统(例如 Spark)的用户实际上是通过 SQL 或类似 SQL 的语言访问数据,而不是直接对 MapReduce 作业进行编程。

Fifth, although Mike does form opinions quickly and doesn’t shy away from sharing them—sometimes in controversial fashion—he’s usually right. A classic example is his quip (in a 2008 blog post with David DeWitt [DeWitt and Stonebraker 2008]) that Hadoop was “a major step backwards,” which resulted in some on the Internet declaring that Mike had “jumped the shark.” However, the point of that post—that most data processing was better done with SQL-like languages—was prescient. Hadoop was rather quickly displaced. Today, most users of post-Hadoop systems, like Spark, actually access their data through SQL or an SQL-like language, rather than programming MapReduce jobs directly.

最后,迈克对(用他的一句名言来说)“零十亿美元”的想法不感兴趣。迈克的研究是由现实世界用户(通常是商业用户)想要或需要的内容驱动的,他的许多研究项目都受到商业用户告诉他的他们最大的痛点的启发。1这是发现问题的一个很好的策略,因为它保证研究对某人很重要并且会产生影响。

Finally, Mike is not interested in (to use a famous phrase of his) “zero-billion-dollar” ideas. Mike’s research is driven by what real-world users (typically commercial users) want or need, and many of his research projects are inspired by what business users have told him are their biggest pain points.1 This is a great strategy for finding problems, because it guarantees that research matters to someone and will have an impact.

迈克的技术贡献

Mike’s Technical Contributions

在本介绍的其余部分中,我们将讨论合作者对 Mike 的“大系统”的技术贡献和轶事。本节中的章节按时间顺序排列,从他在加州大学伯克利分校的时间开始,涉及 Ingres 和 Postgres/Illustra 项目和公司,然后继续到他在麻省理工学院以及 Aurora 和 Borealis/StreamBase、C-Store/ 的时间。 Vertica、H-Store/VoltDB、SciDB/Paradigm4 和 Data Tamer/Tamr 项目/公司。令人惊讶的是,这篇介绍实际上遗漏了他的两家公司(Mariposa 和 Goby),因为它们在商业和学术上的重要性都不如他的许多研究项目,其中一些项目您会在本书的其他地方提到。最后,他目前的一些合作者谈论了迈克在他最近的研究中所做的事情。

In the rest of this introduction, we discuss technical contributions and anecdotes from collaborators on Mike’s “big systems.” The chapters in this section go in chronological order, starting with his time at UC Berkeley, with the Ingres and Postgres/Illustra projects and companies, and then going on to his time at MIT and the Aurora and Borealis/StreamBase, C-Store/Vertica, H-Store/VoltDB, SciDB/Paradigm4, and Data Tamer/Tamr projects/companies. Amazingly, this introduction actually leaves out two of his companies (Mariposa and Goby), because they are less significant both commercially and academically than many of his research projects, some of which you’ll find mentioned elsewhere in the book. Finally, some of his current collaborators talk about what Mike has been doing in his most recent research.

伯克利岁月

The Berkeley Years

对于迈克于 1971 年到达伯克利时开始的安格尔项目,迈克·凯里 (Mike Carey) 描述了该项目的惊​​人大胆及其持久影响(第 15 章。迈克·凯里 (Mike Carey) 是迈克早期的博士生之一。学生,获得博士学位。1983 年,他凭借自己的能力成为最有影响力的数据库系统构建者之一。他分享了一位具有数学背景的初级教授(论文标题:“随机链的大规模马尔可夫模型的简化”)如何认为构建 Ted Codd 关系代数的实现是一个好主意的观点,并描述了 Mike 如何Stonebraker 在 IBM 构建 System R 时与 10 多名博士组成的团队进行了自己的竞争(第 35 章))。正如 Mike Carey 所描述的,Ingres 项目有许多重要的研究贡献:声明性语言 (QUEL) 和查询执行、查询重写和视图替换算法、散列、索引、事务、恢复以及我们现在认为理所当然的许多其他想法关系数据库。很喜欢在 Mike 的工作中,Ingres 作为一个开源项目(第 12 章)具有同样的影响力,它成为了几个重要的商业数据库系统的基础(在第 13 章中以生动的颜色进行了说明)。

For the Ingres project, which Mike began when he got to Berkeley in 1971, Mike Carey writes about the stunning audacity of the project and its lasting impact (Chapter 15). Mike Carey was one of Mike’s early Ph.D. students, earning his Ph.D. in 1983 and becoming one of the most influential database systems builders in his own right. He shares his perspective on how a junior professor with a background in math (thesis title: “The Reduction of Large Scale Markov Models for Random Chains”) decided that building an implementation of Ted Codd’s relational algebra was a good idea, and describes how Mike Stonebraker held his own competing against a team of 10-plus Ph.D.s at IBM building System R (Chapter 35). As Mike Carey describes, the Ingres project had many significant research contributions: a declarative language (QUEL) and query execution, query rewrites and view substitution algorithms, hashing, indexing, transactions, recovery, and many other ideas we now take for granted as components of relational databases. Like much of Mike’s work, Ingres was equally influential as an open-source project (Chapter 12) that became the basis of several important commercial database systems (illustrated in living color in Chapter 13).

在他关于 Mike 在伯克利的下一个大型项目 Postgres(“Post-Ingres”)的文章中,Joe Hellerstein(第 16 章))(他在 Postgres 时代与 Mike 一起完成了硕士工作,是数据库系统领域的另一位领军人物)将 Postgres 描述为“Stonebraker 最雄心勃勃的研究项目”。事实上,这个系统充满了重要且有影响力的想法。最重要和持久的是通过所谓的“对象关系”模型支持数据库中的抽象数据类型(也称为用户定义类型),作为当时流行的“面向对象数据库”模型的替代方案(该模型自此以来)失宠了)。ADT 现在是所有数据库系统实现可扩展类型的标准方式,对于现代系统至关重要。其他重要的想法包括不覆盖存储/时间旅行(数据库可以存储其整个增量历史记录并提供任何历史时间点的视图)和规则/触发器(每当数据库更改时可以执行的操作) 。正如 Joe 所描述的,Postgres 还促成了 Mike 第一个商业成功的 Illustra 的开发(第 25 章),以及(最终)具有巨大影响力的 PostgreSQL 开源数据库的发布,它是当今仍在使用的“两大”开源 RDBMS 平台(与 MySQL)之一(第 12 章

In his write-up about Mike’s next big Berkeley project, Postgres (“Post-Ingres”), Joe Hellerstein (Chapter 16) (who did his Master’s work with Mike during the Postgres era and is another leading light in the database system area) describes Postgres as “Stonebraker’s most ambitious research project.” And indeed, the system is packed full of important and influential ideas. Most important and lastingly is support for abstract data types (aka user-defined types) in databases through the so-called “Object-Relational” model, as an alternative to the then-popular “Object-Oriented Database” model (which has since fallen out of favor). ADTs are now the standard way all database systems implement extensible types and are critically important to modern systems. Other important ideas included no-overwrite storage/time travel (the idea a database could store its entire history of deltas and provide a view as of any historical point in time) and rules/triggers (actions that could be performed whenever the database changed). As Joe describes, Postgres also led to the development of Mike’s first big commercial success, Illustra (Chapter 25), as well as (eventually) the release of the hugely influential PostgreSQL open-source database, which is one of the “big two” open-source RDBMS platforms (with MySQL) still in use today (Chapter 12).

搬到麻省理工学院

The Move to MIT

迈克于 20 世纪 90 年代末离开伯克利,几年后转到麻省理工学院担任兼职教授。尽管他已经比大多数学者在职业生涯中取得了更多的学术和商业成就,但他热情地开始了一系列非凡的研究项目,从流处理项目 Aurora 和 Borealis 开始(第 17 章)。这些项目标志着麻省理工学院、布朗大学和布兰迪斯学院之间一系列长期合作的开始,我在 2004 年来到麻省理工学院时就加入了这些合作。在他们关于极光/北欧化工的章节中,玛格达·巴拉津斯卡(Magda Balazinska,现为华盛顿大学教授)然后是该项目的学生)和斯坦·兹多尼克(布朗大学教授,迈克最亲密的合作者之一)反思了北欧化工项目的广泛想法,该项目考虑了如何重新设计需要处理“的数据处理系统”流”:连续到达的数据序列,只能查看一次。与 Mike 的许多项目一样,Aurora 最终成为 StreamBase(第 26 章)),一家成功的初创公司(几年前被 TIBCO 收购)专注于这种实时数据,在金融、物联网和其他领域都有应用。

Mike left Berkeley in the late 1990s and a few years later moved to MIT as an adjunct professor. Even though he had already achieved more academically and commercially than most academics do in a career, he enthusiastically embarked on a remarkable series of research projects, beginning with the stream processing projects Aurora and Borealis (Chapter 17). These projects marked the beginning of a long series of collaborations among MIT, Brown, and Brandeis, which I would join when I came to MIT in 2004. In their chapter about Aurora/Borealis, Magda Balazinska (now a professor at the University of Washington and then a student on the projects) and Stan Zdonik (a professor at Brown and one of Mike’s closest collaborators) reflect on the breadth of ideas from the Borealis project, which considered how to re-engineer a data processing system that needs to process “streams”: continuously arriving sequences of data that can be looked at only once. Like many of Mike’s projects, Aurora eventually became StreamBase (Chapter 26), a successful startup (sold to TIBCO a few years ago) focused on this kind of real-time data, with applications in finance, Internet of Things, and other areas.

“一刀切”时代

The “One Size Doesn’t Fit All” Era

2000 年代,Mike 的研究步伐加快,他以惊人的速度创办了公司,2005 年至 2013 年间成立了五家公司(Vertica、Goby、Paradigm4、VoltDB、Tamr)。其中两家公司(Vertica 和 Goby)以收购,其他三个至今仍然活跃。作为 Vertica、Paradigm4 和 VoltDB 的灵感来源,Mike 引用了他的名言“一刀切”,这意味着尽管在技术上可以分别运行大规模分析、科学数据和事务处理工作负载,在传统的关系数据库中,这样的数据库将特别擅长处理这些工作负载。相比之下,通过构建专门的系统,可以实现数量级的加速。通过这三个公司及其相关的研究项目,

In the 2000s, the pace of Mike’s research accelerated, and he founded companies at a breakneck pace, with five companies (Vertica, Goby, Paradigm4, VoltDB, Tamr) founded between 2005 and 2013. Two of these—Vertica and Goby—ended with acquisitions, and the other three are still active today. As inspiration for Vertica, Paradigm4, and VoltDB, Mike drew on his famous quip that “one size does not fit all,” meaning that although it is technically possible to run massive-scale analytics, scientific data, and transaction processing workloads, respectively, in a conventional relational database, such a database will be especially good at none of these workloads. In contrast, by building specialized systems, order-of-magnitude speedups are possible. With these three companies and their accompanying research projects, Mike set out to prove this intuition correct.

就便利店而言,我们的想法是表明,所谓的“面向列”方法可以更好地服务于分析工作负载(包括一次处理多条记录的读取密集型工作负载),其中来自同一列的数据列存储在一起(例如,每列存储在磁盘上的单独文件中)。这种设计对于具有大量小型读取或写入的工作负载来说并不是最佳选择,但对于读取密集型工作负载具有许多优点,包括更好的 I/O 效率和可压缩性。Daniel Abadi(现为马里兰大学教授)撰写了他作为该项目研究生的早期经历(第18 章))以及迈克如何在项目过程中帮助塑造他对系统设计的思考方式。C-Store 是经典的“系统研究”:没有新的深刻的理论思想,但大量的设计用于就使用哪些组件以及如何组合它们以实现工作原型和系统做出正确的决定。Mike 将 C-Store 商业化为 Vertica 分析数据库,该数据库非常成功,并且仍然是使用更广泛的商业分析数据库系统之一。Vertica 于 2011 年被 HP 收购,现归 Micro Focus, Int'l PLC 所有。

In the case of C-Store, our idea was to show that analytics workloads (comprising read-intensive workloads that process many records at a time) are better served by a so-called “column-oriented” approach, where data from the same column is stored together (e.g., with each column in a separate file on disk). Such a design is suboptimal for workloads with lots of small reads or writes but has many advantages for read-intensive workloads including better I/O efficiency and compressibility. Daniel Abadi (now a professor at the University of Maryland) writes about his early experiences as a grad student on the project (Chapter 18) and how Mike helped shape his way of thinking about system design through the course of the project. C-Store was classic “systems research”: no new deep theoretical ideas, but a lot of design that went into making the right decisions about which components to use and how to combine them to achieve a working prototype and system. Mike commercialized C-Store as the Vertica Analytic Database, which was quite successful and continues to be one of the more widely used commercial analytic database systems. Vertica was acquired by HP in 2011 and is now owned by Micro Focus, Int’l PLC.

在 H-Store 项目中,我们的想法是看看如何将“一刀切”应用于交易处理系统。关键的观察结果是,通用数据库系统(假设数据不适合内存并使用标准数据结构和旨在处理这种情况的恢复协议)与假设数据不适合内存的系统相比,效率大大降低。驻留在内存中(通常是现代大型主存机器上的事务处理系统的情况)。我们设计了一个新的交易处理系统;Andy Pavlo(他在第 19 章中介绍了 H-Store )和其他几位研究生构建了原型系统,该系统将事务处理的边界推向了几个数量级,超出了通用目的数据库可以实现。H-Store 设计成为 Mike 创立 VoltDB 的蓝图(第 28 章),该数据库至今仍在运营,专注于各种低延迟和实时事务处理用例。最近,Mike 扩展了 H-Store 设计,支持预测性负载平衡(“P-Store”[Taft et al. 2018])和反应性或弹性负载平衡(“E-Store”;[Taft et al. 2018]) .2014a]。

In the H-Store project, the idea was to see how “one size does not fit all” could be applied to transaction processing systems. The key observation was that generalpurpose database systems—which assume that data doesn’t fit into memory and use standard data structures and recovery protocols designed to deal with this case—give up a great deal of efficiency compared to a system where data is assumed to be memory-resident (generally the case of transaction processing systems on modern large main-memory machines). We designed a new transaction processing system; Andy Pavlo (who writes about H-Store in Chapter 19) and several other graduate students built the prototype system, which pushed the boundaries of transaction processing several orders of magnitude beyond what general-purpose databases could achieve. The H-Store design became the blueprint for Mike’s founding of VoltDB (Chapter 28), which is still in business today, focused on a variety of low-latency and real-time transaction processing use cases. More recently, Mike has extended the H-Store design with support for predictive load balancing (“P-Store” [Taft et al. 2018]) and reactive, or elastic, load balancing (“E-Store”; [Taft et al. 2014a].

在 H-Store 之后,由 Mike 和 David DeWitt 领导的一群学者(包括我)开始关注另一个目前“一刀切”数据库无法很好服务的群体:科学家(第 20 章和第29)。特别是,许多生物学家和物理学家拥有数组结构的数据,尽管它可以作为关系存储在数据库中,但并不是自然地结构化的。这个名为 SciDB 的学术项目研究了许多问题,包括数组数据的数据模型和查询语言、如何构建适用于稀疏和密集数组的存储系统,以及如何构建具有内置数据版本控制的系统适合科学应用。以典型的方式,Mike 很快也在这个领域创办了一家公司 Paradigm4。该产品的首席架构师保罗·布朗在一篇有趣的文章中,以一次登山探险的方式描述了从构思到今天的旅程,充满了探险和创业所带来的所有刺激和疲惫(第 20 章)。第 29 章描述了 SciDB 代码线的开发。

After H-Store, a group of academics (including me) led by Mike and David DeWitt went after another community not well served by current “one size fits all” databases: scientists (Chapters 20 and 29). In particular, many biologists and physicists have array-structured data that, although it can be stored in databases as relations, is not naturally structured as such. The academic project, called SciDB, looked at a number of problems, including data models and query languages for array data, how to build storage systems that are good for sparse and dense arrays, and how to construct systems with built-in versioning of data appropriate for scientific applications. In prototypical fashion, Mike quickly started a company, Paradigm4, in this area as well. In an entertaining essay, Paul Brown, the chief architect of the product, describes the journey from conception to today in terms of a mountain-climbing expedition, replete with all of the thrills and exhaustion that both expeditions and entrepreneurship entail (Chapter 20). The development of the SciDB codeline is described in Chapter 29.

2010 年代及以后

The 2010s and Beyond

在 2000 年代一系列一刀切的项目之后,大约在 20 世纪 90 年代之交,Mike 转向了一个新的研究领域:数据集成,或者将来自不同领域的多个相关数据集组合在一起的问题。组织一起形成一个统一的数据集。首先,在 Data Tamer 项目中,Mike 和一组研究人员着手解决重复记录删除和模式集成的问题,即如何获取描述相同类型信息的数据集集合(例如,两个部门的员工)公司的各个部门)并创建具有一致模式的单一、统一、无重复的数据集。这是一个典型的数据库问题,传统上是通过大量的手动工作来解决的。在 Data Tamer 项目中,我们的想法是以实用和功能性的方式尽可能地自动化这个过程。第 21 章),Ihab Ilyas 讲述了他在学术项目上以及作为公司联合创始人的经历,并讲述了他在与 Mike 一起构建真实系统时学到的一些宝贵的经验教训。Tamr 代码线的开发在第 30 章中描述。

After his sequence of one-size-does-not-fit-all projects in the 2000s, around the turn of the decade, Mike turned to a new area of research: data integration, or the problem of combining multiple related data sets from different organizations together into a single unified data set. First, in the Data Tamer project, Mike and a group of researchers set out to work on the problem of record deduplication and schema integration—that is, how to take a collection of datasets describing the same types of information (e.g., employees in two divisions of a company) and create a single, unified, duplicate-free dataset with a consistent schema. This is a classic database problem, traditionally solved through significant manual effort. In the Data Tamer project, the idea was to automate this process as much as possible, in a practical and functional way. This led to the creation of Tamr Inc. in 2013. In his write-up (Chapter 21), Ihab Ilyas talks about his experience on the academic project and as a co-founder of the company and relates some of the valuable lessons he learned while building a real system with Mike. The development of the Tamr codeline is described in Chapter 30.

本节最后讨论了 Mike 最近的一个项目,该项目以 Polystore 的思想为中心:“中间件”允许用户跨多个现有数据库进行查询,而不需要将数据摄入到名为 BigDAWG 的单个系统中,我们将该系统称为 BigDAWG。作为英特尔大数据科技中心的一部分(第 22 章)。本章由英特尔技术研究员、第一个 Polystore 系统的共同开发者 Timothy Mattson 撰写,合著者包括 Jennie Rogers(现为西北大学教授)和 Aaron Elmore(现为芝加哥大学教授) ,两人都是从事 BigDAWG 项目的博士后。BigDAWG 可作为开源代码使用(第 31 章)。

This section concludes with a discussion of one of Mike’s most recent projects, which centers on the idea of polystores: “middleware” that allows users to query across multiple existing databases without requiring the data to be ingested into a single system called BigDAWG, which we built as a part of the Intel Science and Technology Center for Big Data (Chapter 22). The chapter is written by Timothy Mattson, Intel Technical Fellow and a co-developer of one of the first polystore systems, with co-authors Jennie Rogers (now a professor at Northwestern) and Aaron Elmore (now a professor at the University of Chicago), both of whom were postdocs working on the BigDAWG project. BigDAWG is available as open source code (Chapter 31).

在一个相关项目中,Mourad Ouzzani、Nan Tang 和 Raul Castro Fernandez 讨论了与 Mike 合作解决数据集成问题,这些问题超出了模式集成的范围,建立了一个用于查找、发现和合并相关数据集的完整系统,这些数据集可能是从不同的组织或子机构收集的。 -团体(第23章)。例如,制药公司的数据科学家可能希望将他或她的本地药物数据库与公开可用的化合物数据库相关联,这需要找到相似的列和数据,消除噪音或丢失的数据,并将数据集合并在一起。这个名为 Data Civilizer 的项目是与卡塔尔研究与创新中心 (QCRI) 的研究合作的关键部分,值得注意的是,它使 Mike 摆脱了他所遇到的一些典型“系统”问题。致力于更多的算法工作,重点解决围绕近似重复检测、数据清理、近似搜索等的一些实质问题。Aurum 是尚未开发的 Data Civilizer 的一个组件,其开发过程将在第33 章中进行描述。

In a related project, Mourad Ouzzani, Nan Tang, and Raul Castro Fernandez talk about working with Mike on data integration problems that go beyond schema integration toward a complete system for finding, discovering, and merging related datasets, possibly collected from diverse organizations or sub-groups (Chapter 23). For example, a data scientist at a pharmaceutical company may wish to relate his or her local drug database to a publically available database of compounds, which requires finding similar columns and data, eliminating noisy or missing data, and merging datasets together. This project, called Data Civilizer, is a key part of a research collaboration with the Qatar Center for Research and Innovation (QCRI), and is noteworthy for the fact that it has moved Mike away from some of the typical “systems” problems he has worked on toward much more algorithmic work focused on solving a number of nitty-gritty problems around approximate duplicate detection, data cleaning, approximate search, and more. The development of Aurum, a component of the yet-to-be-developed Data Civilizer, is described in Chapter 33.

总之,如果您使用信用卡购物、在线检查银行余额、使用汽车的在线导航数据或使用数据在工作中做出决策,您可能会接触到源自 Mike Stonebraker 并由 Mike Stonebraker 完善的技术。在过去 40 多年里,他和他的许多学生和合作者。我们希望您喜欢以下“幕后花絮”,了解使数据库为我们所有人服务的众多创新。

In summary, if you make purchases with a credit card, check your bank balance online, use online navigation data from your car or use data to make decisions on the job, you’re likely touching technologies that originated with Mike Stonebraker and were perfected by him with his many students and collaborators over the last 40-plus years. We hope that you enjoy the following “under the hood” looks at the many innovations that made databases work for us all.

1 . Mike 描述了这些痛点的使用,这些痛点成为 Ingres 和其他项目的创新和价值的源泉;参见第 7 章

1. Mike describes the use of such pain points that became the source of innovation and value in Ingres and other projects; see Chapter 7.

第七部分 A

PART VII.A

按系统划分的研究贡献

Research Contributions by System

15

15

安格尔晚年

The Later Ingres Years

迈克尔·J·凯里

Michael J. Carey

本章试图记录迈克·斯通布雷克和数据库的一切开始的时代:即安格尔时代!虽然我在加州大学伯克利分校的时间相对较短(1980-1983),但我将尝试记录 1971-1984 年间或前后的许多主要活动和成果。

This chapter is an attempt, albeit by a fairly latecomer to the Ingres party, to chronicle the era when it all started for Mike Stonebraker and databases: namely, the Ingres years! Although my own time at UC Berkeley was relatively short (1980–1983), I will attempt to chronicle many of the main activities and results from the period 1971–1984 or thereabouts.

我是如何参加安格尔派对的

How I Ended Up at the Ingres Party

我来到伯克利,特别是来到 Mike Stonebraker 的数据库课程和研究门口的道路完全是偶然的。作为卡内基梅隆大学 (CMU) 的一名电气工程 (EE) 数学本科生,我试图在 CMU 之前学习计算机工程或计算机科学专业的本科生,因此我选修了 Anita Jones 教授的一些计算机科学课程。当时,Anita 正在为 50 节点 NUMA 多处理器 (Cm*) 构建早期分布式操作系统 (StarOS) 之一。她还共同指导了我的硕士学位论文(我为此编写了一个在 Cm* 上运行的电力系统模拟器)。我决定攻读博士学位。在计算机科学领域,当我需要选择一个项目和潜在的目标顾问时,我寻求了安妮塔的明智建议。我倾向于离家较近(东海岸),但安妮塔“让我”去了录取我的最好的学校,那就是伯克利。当我向她询问关于潜在的面向系统的研究顾问的建议时——因为我想研究“系统的东西”,最好是并行或分布式系统——Anita 建议看看这个名叫 Mike Stonebraker 的人以及他在数据库系统方面的工作(Ingres 和分布式系统)。安格尔)。CMU 的计算机科学课程不涉及数据库,而且我的父母从事商业和会计工作,我认为从事数据库工作听起来很无聊:可能计算机科学相当于会计。但我还是在心里忽略了她的建议,然后前往伯克利。“最好是并行或分布式系统——Anita 建议查看这个名叫 Mike Stonebraker 的人以及他在数据库系统(Ingres 和分布式 Ingres)方面的工作。CMU 的计算机科学课程不涉及数据库,而且我的父母从事商业和会计工作,我认为从事数据库工作听起来很无聊:可能计算机科学相当于会计。但我还是在心里忽略了她的建议,然后前往伯克利。“最好是并行或分布式系统——Anita 建议查看这个名叫 Mike Stonebraker 的人以及他在数据库系统(Ingres 和分布式 Ingres)方面的工作。CMU 的计算机科学课程不涉及数据库,而且我的父母从事商业和会计工作,我认为从事数据库工作听起来很无聊:可能计算机科学相当于会计。但我还是在心里忽略了她的建议,然后前往伯克利。

The path that led me to Berkeley, and specifically to Mike Stonebraker’s database classes and research doorstep, was entirely fortuitous. As an electrical engineering (EE) math undergraduate at Carnegie Mellon University (CMU) trying to approximate an undergraduate computer engineering or computer science major before they existed at CMU, I took a number of CS classes from Professor Anita Jones. At the time Anita was building one of the early distributed operating systems (StarOS) for a 50-node NUMA multiprocessor (Cm*). She also co-supervised my master’s degree thesis (for which I wrote a power system simulator that ran on Cm*). I decided to pursue a Ph.D. in CS, and I sought Anita’s sage advice when it came time to select a program and potential advisors to target. I was inclined to stay closer to home (East Coast), but Anita “made me” go to the best school that admitted me, which was Berkeley. When I asked her for advice on potential systems-oriented research advisors—because I wanted to work on “systems stuff,” preferably parallel or distributed systems—Anita suggested checking out this guy named Mike Stonebraker and his work on database systems (Ingres and Distributed Ingres). CMU’s CS courses didn’t cover databases and, with parents who worked in business and accounting, I thought that working on databases sounded boring: possibly the CS equivalent of accounting. But I mentally filed away her advice nonetheless and headed off to Berkeley.

到达伯克利后,是时候探索计算机科学的课程景观,为第一层博士学位做准备了。考试以及探索我尚未接触过的领域。我勉强听从了安妮塔的建议,报名参加了迈克教授的高级数据库系统课程。到那时(1980-81 学年),Mike 和他的同事 Eugene Wong 教授已经构建并向世界交付了 Ingres 关系 DBMS(稍后会详细介绍),它是我第一个实践作业的基础。数据库类。

Once at Berkeley, it was time to explore the course landscape in CS to prepare for the first layer of Ph.D. exams as well as to explore areas to which I had not yet been exposed. Somewhat grudgingly following Anita’s advice, I signed up for an upper-division database systems class taught by Mike. By that time (academic year 1980–81), Mike and his faculty colleague Professor Eugene Wong had built and delivered to the world the Ingres relational DBMS—more on that shortly—and it was the basis for the hands-on assignments in my first database class.

那次第一堂课彻底改变了我对事物的看法。事实证明它真的很有趣,而且迈克(尽管他会说自己是一名老师)在教授它并使其变得有趣方面做得很好,从而吸引我走上了数据库之路。接下来我继续参加迈克的研究生数据库课程。那时我就​​着迷了:非常酷的东西。我了解到,这个新兴的数据库系统领域实际上是所有计算机科学的垂直部分——包括语言、理论、操作系统、分布式系统、人工智能(规则系统)等方面——但强调“做它”数据”并以声明的方式。

That first class completely changed my view of things. It turned out to be really interesting, and Mike (despite what he will say about himself as a teacher) did a great job of teaching it and making it interesting, thereby luring me down a path toward databases. I went on to take Mike’s graduate database class next. By then I was hooked: very cool stuff. I learned that this newly emerging database systems field was really a vertical slice of all of CS—including aspects of languages, theory, perating systems, distributed systems, AI (rule systems), and so on—but with an emphasis on “doing it for data” and in a declarative way.

数据库毕竟没那么无聊。事实上,它们很有趣,现在在我可能获得博士学位的名单上。子区域。当我通过初步考试并寻求顾问时,我的清单上有三到四种可能性:计算机体系结构(戴夫·帕特森)、操作系统/分布式系统(约翰·奥斯特豪特或迈克·鲍威尔),以及课程数据库(迈克)。我和每个人谈论了攻读博士学位的道路。在他们看来,我听到的大部分内容是博士学位的提升五年。山。但不是来自迈克!他想到了各种潜在的话题,让我相信事务管理就像操作系统(确实如此),并告诉我他可以在三年内让我离开。卖给出价最低者!当时,迈克创造了一种有趣且有吸引力的数据库团队文化,这也没有什么坏处,

Databases weren’t so boring after all. In fact, they were interesting and now on my list of possible Ph.D. subareas. By the time I passed the preliminary exams and it was time to seek an advisor, I had three or four possibilities on my list: computer architecture (Dave Patterson), operating systems/distributed systems (John Ousterhout or perhaps Mike Powell), and of course databases (Mike). I talked to each one about the road to a Ph.D. in their view, and for the most part what I heard was a five-year hike up the Ph.D. mountain. But not from Mike! He had various potential topics in mind, convinced me that transaction management was like operating systems (it is), and told me he could have me out in three years. Sold to the lowest bidder! It also didn’t hurt that, at the time, Mike had created a fun and inviting database group culture, insisting that interested students join the Ingres team at La Val’s Pizza up the street from Cory Hall (a.k.a. Ingres HQ) for periodic outings for pizza and pitchers of beer (Where, I seem to recall, Mike always had adverse reactions to empty glasses).

Ingres:实现(并共享!)关系 DBMS

Ingres: Realizing (and Sharing!) a Relational DBMS

Mike 的数据库故事始于 Ingres [Held 和 Stonebraker 1975],它是 INinteractive Graphics REtrieval System 的缩写。它是世界上两个高影响力的早期关系型 DBMS 之一;另一个是 IBM 的 System R,这是 IBM 圣何塞研究中心的一个并行项目。Ingres 系统是由 Mike 和 Gene(比他年长 10 岁,也是说服 Mike 阅读 Ted Codd 的著作的人)共同开发的。开创性的论文并转变为“数据库专家”)。我怀念 Ingres 的早年,因为我是 1980 年到达的,直到 1981 年才开始在 Ingres 领域闲逛,那时最初的 Ingres 项目已经结束,分布式 Ingres 也基本结束了(第 5 章)。许多早期的 Ingres 系统英雄已经来了又去,例如 Jerry Held、Karel Youssefi、Dan Ries、Bob Epstein,并且 Ingres 的单系统版本已分发到 1,000 多个站点(史前开源!)。分布式 Ingres 项目也基本上已经“完成”:有一个原型可以对地理分布的数据运行分布式查询,但它从未像 Ingres 那样硬化或共享。我与团队成员 Dale Skeen 重叠了一年,他致力于分布式提交协议并指导我(谢谢 Dale!),然后离开到学术界,然后创办了几家著名公司,包括 TIBCO,pub/sub 的诞生地(发布/订阅)。订阅消息)。我确实见到了埃里克·奥尔曼(Eric Allman)。他是安格尔集团的骨干(职员)。

Mike’s database story begins with Ingres [Held and Stonebraker 1975], short for INteractive Graphics REtrieval System. It was one of the world’s two high-impact early relational DBMSs; the other was System R from IBM, a concurrent project at the IBM San Jose Research Center. The Ingres system was co-developed by Mike and Gene (ten years his senior, and the person who talked Mike into reading Ted Codd’s seminal paper and into converting to be a “database guy”). I missed the early years of Ingres, as I arrived in 1980 and started hanging out in Ingres territory only in 1981, by which time the initial Ingres project was over and Distributed Ingres was also mostly winding down (Chapter 5). A number of the early Ingres system heroes had already come and gone—e.g., Jerry Held, Karel Youssefi, Dan Ries, Bob Epstein—and the single-system version of Ingres had been distributed to over 1,000 sites (prehistoric open source!). The Distributed Ingres project was also largely “done”: there was a prototype that ran distributed queries over geographically distributed data, but it was never hardened or shared like Ingres. I overlapped for a year with team member Dale Skeen, who worked on distributed commit protocols and who mentored me (thanks, Dale!) and then left for academia before starting several notable companies, including TIBCO, the birthplace of pub/sub (publish/subscribe messaging). I did get to meet Eric Allman; he was the backbone (staff member) for the Ingres group.

在我还是一名研究生的时候,以及在威斯康星大学麦迪逊分校作为一名学者的“第一份职业生涯”中,安格尔非常出名并受到高度重视(第 6 章。我担心今天情况并非如此,随着 IBM DB2 系统(Ingres 的主要竞争对手)在商业上的成功,这使得 System R 在当今学生的后视镜中更加明显。事实上,Ingres——迈克和吉恩送给数据库世界的第一份礼物——确实是一项了不起的成就,它以非常值得在此重申的方式帮助塑造了当今的领域。

In my day as a graduate student and in my “first career” as an academic at the University of Wisconsin-Madison, Ingres was tremendously well known and highly regarded (Chapter 6). I fear that the same is not true today, with the commercial success of IBM’s DB2 system—the main competitor for Ingres—which has made System R much more visible in today’s rear-view mirror for students. In reality, Ingres—Mike and Gene’s first gift to the database world—was truly a remarkable achievement that helped shape the field today in ways that are well worth reiterating here.

安格尔在技术和社会方面做出了许多贡献。在我看来,这仍然是 Mike 最好的数据库成就,考虑到 Mike 在其职业生涯中所做的许多事情,这显然说明了很多事情。

Ingres made a number of contributions, both technically and socially. In my view, it is still Mike’s finest database achievement, which is obviously saying a lot given the many things that Mike has done over the course of his career.

那么,安格尔有什么大不了的呢?让我们看看安格尔向我们展示了什么……

So, what was the big deal about Ingres? Let’s have a look at what Ingres showed us …

1. Ingres 最重要的贡献之一(对于 System R 也是如此)表明,Ted Codd 在 1970 年引入关系模型时并没有失去理智,即构建关系 DBMS 确实是可能的。Ingres 是一个完整的系统:它有查询语言、查询执行系统、持久数据存储、索引、并发控制和恢复,甚至还有 C 语言嵌入(EQUEL [Allman et al. 1976]),所有这些都由一个一所大学的教师、学生和一些工作人员组成的团队 [Stonebraker 1976b]。这本身就是一项真正了不起的成就。

1.  One of the foremost contributions of Ingres—likewise for System R—was showing that Ted Codd wasn’t out of his mind when he introduced the relational model in 1970, i.e., that it was indeed possible to build a relational DBMS. Ingres was a complete system: It had a query language, a query execution system, persistent data storage, indexing, concurrency control, and recovery—and even a C language embedding (EQUEL [Allman et al. 1976])—all assembled by a team of faculty, students, and a few staff at a university [Stonebraker 1976b]. This in and of itself is a truly remarkable achievement.

图像

图 15.1  一个简单的 Quel 连接查询,作者:Mike Stonebraker,约 1975 年。来源:[Stonebraker 1975]。

Figure 15.1  A simple Quel join query, by Mike Stonebraker circa 1975. Source: [Stonebraker 1975].

2. Ingres 展示了如何基于 Codd 的声明性数学语言思想设计一种简洁的声明性查询语言 Quel。Quel 是数据库类中使用的语言,它改变了我。这是一种非常干净的语言,而且很容易学习——事实上,非常简单,以至于我有一次从伯克利回家时教了我母亲一点奎尔语,她“学会了”。Quel 的设计基于元组关系演算——添加聚合和分组——并且(在我看来)它应该赢得 20 世纪 80 年代的查询语言战争。SQL 不太简洁,是关系演算(FROM)和代数(JOIN、UNION)的奇怪组合。在我看来,它的胜利很大程度上是因为 Larry Ellison 阅读了 System R 论文(Oracle 在其新员工培训中实际上引用了其中一篇!)并与 IBM 一起决定将 SQL 商业化和普及。1作为历史记录的示例,以下(参见图 15.1)是来自 Mike 自己编写的原始用户手册(作为电子研究实验室 (ERL) 备忘录 [Stonebraker 1975])的简单 Quel 连接查询。

2.  Ingres showed how one could design a clean, declarative query language—Quel—based on Codd’s declarative mathematical language ideas. Quel was the language used in the database class that converted me. It was a very clean language and easy to learn—so easy in fact that I taught my mother a little bit of Quel during one of my visits home from Berkeley and she “got it.” Quel’s design was based on the tuple relational calculus—adding aggregates and grouping—and (in my opinion) it should have won the query language wars in the 1980s. SQL is less concise and an odd mix of the relational calculus (FROM) and algebra (JOIN, UNION). In my view, it won largely because Larry Ellison read the System R papers (one of which Oracle actually cites in its new-employee training!) and joined IBM in deciding to commercialize and popularize SQL.1 As an example for the historical record, the following (see Figure 15.1) is a simple Quel join query from the original user manual that Mike himself wrote (as an Electronics Research Lab (ERL) memo [Stonebraker 1975]).

3. Ingres 表明人们可以创建一种声明性查询语言的有效实现。Ingres 包含一个查询优化器,它 (a) 根据输入大小和连接谓词对连接重新排序,以及 (b) 通过从可用索引支持的访问路径中进行选择来访问关系。有趣的是,Ingres 优化器与查询执行器逐步协同工作,交替选取查询的下一个“最佳”部分,运行它,查看结果大小,然后继续下一步 [Wong 和 Youssefi 1976]。System R 采用了更加静态、基于预编译的方法。System R 因教导世界如何使用统计和成本函数以及动态编程来编译查询而闻名(理所当然)。然而,

3.  Ingres showed that one could create an efficient implementation of a declarative query language. Ingres included a query optimizer that (a) reordered joins based on their input sizes and connecting predicates and (b) accessed relations by picking from among the access paths supported by the available indexes. Interestingly, the Ingres optimizer worked together with the query executor incrementally—alternately picking off the next “best” part of the query, running it, seeing the result size, and then proceeding to the next step [Wong and Youssefi 1976]. System R took a more static, pre-compilation-based approach. System R is (rightly) famous for teaching the world how to use statistics and cost functions and dynamic programming to compile a query. However, today there is renewed interest in more runtime-oriented approaches—not unlike the early Ingres approach—that do not rely as heavily on a priori statistics and costing.

4.  Ingres 表明,人们可以构建一个完整的存储管理器,包括优化器可以查看和使用的堆文件和各种索引(ISAM(索引顺序访问方法)和散列),并支持并发事务和崩溃恢复。作为一个较小的、基于大学的项目,Ingres 采用了更简单的事务方法(例如,关系级锁并在锁定之前按关系名称排序以避免死锁)。但安格尔系统仍然完整、交付并使用!作为历史侧边栏,我最喜欢的被遗忘的 Mike 论文之一是计算机协会通信 (CACM) 文章(“B 树重新审查”[Held 和 Stonebraker 1978]),其中 Mike 和 Jerry Held 解释了为什么 B+树木可能永远无法工作,

4.  Ingres showed that one could build a full storage manager, including heap files and a variety of indexes (ISAM (Indexed Sequential Access Method) and hashing) that an optimizer could see and use, and with support for concurrent transactions and crash recovery. Being a smaller, university-based project, Ingres took a simpler approach to transactions (e.g., relation-level locks and sorting by relation name before locking to avoid deadlocks). But the Ingres system was nevertheless complete, delivered, and used! As a historical sidebar, one of my favorite forgotten Mike papers was a Communications of the Association for Computing Machinery (CACM) article (“B-Trees Re-examined” [Held and Stonebraker 1978]) in which Mike and Jerry Held explained why B+ trees might never work, due to then-unsolved challenges of dealing with concurrency control and recovery for dynamic index structures, and why static indexes like those used in Ingres were preferable.

5. Ingres 是为 AT&T Unix 构建的,并导致了一个功能齐全的系统,该系统被其他大学和实验室的许多人共享和使用(再次认为“史前开源”)来实际存储和查询他们的数据。举个例子,当我离开伯克利抵达威斯康星州时,我发现威斯康星大学麦迪逊分校经济学教授 Martin David 和他的几位同事正在使用 Ingres 来存储和查询美国政府收入和计划参与调查 (SIPP) 数据 [弗洛里等人。1988]。作为 CMU 研究生的另一个例子,Rick Snodgrass 使用 Ingres 来存储和查询来自并行 Cm* 程序执行的软件事件以进行调试 [Snodgrass 1982],这启发了 Rick 后来的时态数据库职业生涯(请参阅第 12 章有关 Mike 及其开源影响的更多信息)。

5.  Ingres was built for AT&T Unix and led to a full-featured system that was shared and used (again, think “prehistoric open source”) by many folks at other universities and labs to actually store and query their data. As an example, when I landed in Wisconsin after leaving Berkeley, I discovered that University of Wisconsin-Madison economics professor Martin David and several of his colleagues were using Ingres to store and query U.S. government Survey of Income and Program Participation (SIPP) data [Flory et al. 1988]. As another example as a graduate student at CMU Rick Snodgrass used Ingres to store and query software events coming from parallel Cm* program executions for debugging [Snodgrass 1982], which is what inspired Rick’s later temporal database career (see Chapter 12 for more on Mike and his open-source impact(s)).

6.安格尔用于教学。20 世纪 80 年代的许多学生之所以能够接触到关系数据库技术,仅仅是因为 Ingres 系统在数据库系统入门课程中广泛分布和使用。它帮助培养了全新一代精通关系数据库的计算机科学家!

6.  Ingres was used for teaching. Many students in the 1980s were able to get their hands on relational database technology solely because the Ingres system was so widely distributed and used in introductory database system classes. It helped to create a whole new generation of relational-database-savvy computer scientists!

7. Ingres 是一个足够完整和可靠的软件系统,它成为当时非常成功的商业 RDBMS 的基础,即 RTI (Relational Technology, Inc.) Ingres。请参阅Paul Butterworth 和 Fred Carter 撰写的第 24 章,其中对 Commercial Ingres 的诞生和后续发展进行了非常有趣的讨论,包括它是如何开始的、与大学版本相比发生了什么变化以及它如何随着时间的推移而演变(并且仍然受到大学研究的启发) )。

7.  Ingres was a sufficiently complete and solid software system that it became the basis for a very successful commercial RDBMS of its day, namely RTI (Relational Technology, Inc.) Ingres. See Chapter 24 by Paul Butterworth and Fred Carter for a very interesting discussion of the birth and subsequent development of Commercial Ingres, including how it began, what was changed from the university version, and how it evolved over time (and remained informed by university research).

分布式安格尔:一个很好,所以越多越好

Distributed Ingres: One Was Good, So More Must be Better

正如我之前提到的,继 Ingres 之后,Mike 的下一步是启动分布式 Ingres 项目 [Stonebraker 和 Neuhold 1977]。大约 1980 年的分布式 DBMS 愿景(当时基本上整个 DB 研究界都共享的愿景)是一个单图像关系 DBMS,可以管理地理上分布的数据(例如,站点位于伯克利、圣何塞和圣何塞的数据库)。 Francisco),同时使其看起来好像所有数据都存储在单个本地关系 DBMS 中。虽然分布式 Ingres 的努力并没有带来另一个广泛共享的“开源”系统,但它确实产生了一个非常有趣的原型以及许多与分布式查询处理和事务管理相关的技术成果。当时主要的竞争项目有IBM的R*(即分布式系统R)、CCA的SDD-1(位于剑桥,MA),以及一小部分欧洲的努力(例如 SIRIUS)。分布式 Ingres 工作的贡献如下。

As I alluded to earlier, the next step for Mike, after Ingres, was to launch the Distributed Ingres project [Stonebraker and Neuhold 1977]. The distributed DBMS vision circa 1980—a vision shared by essentially the entire DB research community at that time—was of a single-image relational DBMS that could manage geographically distributed data (e.g., a database with sites in Berkeley, San Jose, and San Francisco) while making it look as though all the data was stored in a single, local relational DBMS. While the Distributed Ingres effort didn’t lead to another widely shared “open source” system, it did produce a very interesting prototype as well as many technical results related to distributed query processing and transaction management. Major competing projects at the time were R* at IBM (i.e., distributed System R), SDD-1 at CCA (in Cambridge, MA), and a small handful of European efforts (e.g., SIRIUS). Among the contributions of the Distributed Ingres effort were the following.

1. Bob Epstein 与 Mike(还有 Gene)合作得出的关于分布式数据存储和查询的结果 [Epstein 等人,2017] 1978]。关系可以整体分布,也可以作为具有关联谓词的片段分布。(见图15.2,直接借用了 Stonebraker 和 Neuhold [1977],在 20 世纪 70 年代末的图形辉煌中。)供应关系可能位于 San Jose,其中还可能存储供应商的片段,其中 seller.city=“San Jose”,以及其他供应商片段存储在伯克利和其他地方。)该项目产生了一些初步成果,并在原型中实现,涉及地理分布式环境中的查询处理。分布式 Ingres 概括了 Ingres 的面向迭代运行时的方法,并考虑了如何以及何时移动数据,以便运行涉及多个关系及其片段的分布式查询。

1.  Results by Bob Epstein, working with Mike (and also Gene), on distributed data storage and querying [Epstein et al. 1978]. Relations could be distributed in whole or as fragments with associated predicates. (See Figure 15.2, borrowed directly from Stonebraker and Neuhold [1977] in all of its late 1970s graphical glory.) The supply relation might be in San Jose, which might also store a fragment of supplier where supplier.city= “San Jose,” with other supplier fragments being stored in Berkeley and elsewhere.) The project produced some of the first results, implemented in the prototype, on query processing in a geographically distributed setting. Distributed Ingres generalized the iterative runtime-oriented approach of Ingres and considered how and when to move data in order to run distributed queries involving multiple relations and fragments thereof.

2. Dan Ries 基于分布式 Ingres 原型进行的模拟得出的关于此类环境中并发控制(尤其是锁定)的结果 [Ries 和 Stonebraker 1977a]。这是关于分布式锁定的最早的工作之一,考虑了分布式 DBMS 中锁管理的集中式与分布式方法;研究锁粒度对此类系统的影响;研究处理死锁的替代策略(包括基于预防、抢占和检测的方法)。

2.  Results by Dan Ries on concurrency control—particularly locking—in such an environment, based on simulations informed by the Distributed Ingres prototype [Ries and Stonebraker 1977a]. This was among the earliest work on distributed locking, considering centralized vs. distributed approaches to lock management in a distributed DBMS; studying the impact of lock granularity on such a system; and investigating alternative strategies for handling deadlocks (including methods based on prevention, preemption, and detection).

3.“无共享!”的诞生 随着分布式 Ingres 原型工作逐渐结束,Mike 威胁要创建一个新的、不同的基于 Ingres 的分布式 DBMS 原型,他将其命名为“MUFFIN”[erl-m79-28]。在这份 1979 年 ERL 备忘录(它从未作为论文发表,也从未作为一个系统实现过)中,Mike 提议通过将一组 Ingres(“D-CELL”)组织成一个并行的组合来并行利用它们的综合力量。无共享方式,就像后来在 Teradata、Gamma 和 GRACE 数据库机器中所做的那样 [DeWitt 和 Gray 1992]。

3.  The birth of “shared nothing!” As the Distributed Ingres prototype effort was winding down, Mike threatened to create a new and different Ingres-based distributed DBMS prototype, which he named “MUFFIN” [erl-m79-28]. In this visionary 1979 ERL memo (which was never published as a paper nor realized as a system after all), Mike proposed to harness the combined power of a set of Ingres’ (“D-CELLs”) in parallel by organizing them in a shared-nothing fashion, much as was done later in the Teradata, Gamma, and GRACE database machines [DeWitt and Gray 1992].

图像

图 15.2  早期(1977 年)分布式数据库的简单示例。资料来源:[Stonebraker 和 Neuhold 1977]。

Figure 15.2  A simple example of an early (1977) distributed database. Source: [Stonebraker and Neuhold 1977].

4. Dale Skeen 关于分布式 DBMS 的提交和恢复协议的理论结果,再次受到分布式 Ingres 环境的启发 [Skeen 和 Stonebraker 1983]。这项工作为分析此类问题提供了一个正式模型面对故障的分布式协议,并应用该模型来分析现有协议并合成新协议。集中式和分散式(例如,基于法定人数)方案均被考虑。

4.  Theoretical results by Dale Skeen on commit and recovery protocols for distributed DBMSs, again inspired by the Distributed Ingres context [Skeen and Stonebraker 1983]. This work provided a formal model for analyzing such distributed protocols in the face of failures and applied the model to analyze existing protocols and to synthesize new ones as well. Both centralized and decentralized (e.g., quorum-based) schemes were considered.

5. 关于并发控制替代方案及其性能的实验结果(主要基于模拟)是您真实的结果 [Carey 和 Stonebraker 1984]。我承担这项工作是 Dan Ries 开始的工作的后续工作,因为到了 20 世纪 80 年代初,基于不同机制(锁、时间戳、乐观主义以及认证)的无数种不同的并发控制方法开始出现。 ,正交,版本控制)。这项出色的工作为算法设计空间的维度及其性能影响提供了一些初步的见解。

5.  Experimental, mostly simulation-based, results by yours truly, on concurrency control alternatives and their performance [Carey and Stonebraker 1984]. I undertook this work as a follow-on to what Dan Ries had started, since by the early 1980s a gazillion different approaches to concurrency control were starting to appear based on a diverse set of mechanisms (locks, timestamps, optimism followed by certification, and, orthogonally, versioning). This brilliant work provided some initial insight into the algorithmic design space’s dimensions and their performance implications.

尽管数据库社区抱有希望和梦想,但分布式数据库的同构(单站点映像)方法从未占据主导地位。只是当时还为时过早。尽管如此,当时的研究成果仍然存在于新的环境中,例如并行数据库系统和异构分布式数据库。分布式 Ingres 项目无疑在分布式数据库管理领域结出了大量(实用)成果。

In spite of the database community’s hopes and dreams, the homogeneous (single-site image) approach to distributed databases never took hold. It was just too early at that time. Nevertheless, the research results from those days have lived on in new contexts, such as parallel database systems and heterogeneous distributed databases. The Distributed Ingres project definitely bore a significant amount of (practical) fruit in the orchard of distributed database management.

Ingres:超越业务数据

Ingres: Moving Beyond Business Data

在我走进的早期 Ingres 时代,Mike 就在思考如何将 Ingres 的数据支持提升到新的水平。这是安格尔之地的“系统间”学习时间。Mike 正在研究各种领域:地理信息系统(Ingres 的早期激励领域之一)、办公自动化系统、VLSI CAD(超大规模集成计算机辅助设计)系统(伯克利是该领域的温床)当时),等等。当时,数据库研究社区的许多人都在讨论“业务数据已解决!下一步是什么?” 心态(例如,参见 Katz [1982] 中的论文集)。注意到所有这些领域都有严重的数据管理需求,Mike 和 Ingres 前来救援!为了满足这些需求,迈克开始安排他的研究生,其中许多是硕士生,解决迫在眉睫的“下一代数据管理”问题的不同方面。按照迈克一贯的风格,这些项目不仅仅是纸上设计,而是伴随着基于以不同方式实验性改变 Ingres 代码库的工作原型的设计。由此产生了许多有趣的想法和论文,包括以下内容。

In the early Ingres era that I walked into, Mike was thinking about how to take Ingres to new levels of data support. This was something of a “between systems” study time in Ingres-land. Mike was looking around at a variety of domains: geographical information systems (one of the early motivating areas for Ingres), office automation systems, VLSI CAD (Very Large-Scale Integration Computer-Aided Design) systems (Berkeley was a hotbed in that field at the time), and so on. At the time, a number of folks in the DB research community were in a “Business data solved! What’s next?” mindset (e.g., see the collection of papers in Katz [1982]). Noting that all these domains had serious data management needs, Mike and Ingres came to the rescue! To address these needs, Mike began to orchestrate his graduate students, many of them master’s students, to tackle different facets of the looming “next-generation data management” problem. In usual Mike fashion, these projects were not simply paper designs, but designs accompanied by working prototypes based on experimentally changing the Ingres code base in different ways. Many interesting ideas and papers resulted, including the following.

1. Ingres 的“假设关系”概念 [Stonebraker and Keller 1980]。这个想法是提供创建关系的假设分支的能力:关系的差异版本,可以在不影响源关系的情况下进行更新和探索。目的是为“假设”数据库用例提供开箱即用的数据库支持。

1.  A notion of “hypothetical relations” for Ingres [Stonebraker and Keller 1980]. The idea was to provide the ability to create a hypothetical branch of a relation: a differential version of a relation that could be updated and explored without impacting the source relation. The aim was to provide out-of-the-box database support for “what if” database use cases.

2. 为 Ingres 添加功能,使其成为文档数据管理的可行平台 [Stonebraker 1983a]。这一系列工作中添加的功能包括对可变长度字符串的支持(Ingres 最初没有)、有序关系的概念、Quel 的各种新子字符串运算符、用于分解字符串字段的新中断运算符以及广义连接运算符(聚合函数)用于执行相反的操作。

2.  The addition of features to Ingres to enable it to be a viable platform for document data management [Stonebraker 1983a]. Features added in this line of work included support for variable-length strings (which Ingres didn’t initially have), a notion of ordered relations, various new substring operators for Quel, a new break operator for decomposing string fields, and a generalized concatenate operator (an aggregate function) for doing the reverse.

3. Ingres 添加了支持 CAD 数据管理的功能 [Guttman 和 Stonebraker 1982]。迄今为止,这项工作最著名的成果要归功于我现在著名的(但说话温和)的同事托尼·古特曼(Toni Guttman):R-Tree 索引结构的发明 [Stonebraker and Guttman 1984]。Toni 正在研究新兴的 Caltech Mead-Conway VLSI 设计时代的 VLSI CAD 数据存储问题。为了简化超大规模集成电路芯片的设计问题,出现了一种新方法——从而使世界充满了数百万个需要管理的矩形!为了索引 VLSI 设计的二维几何结构,Toni 推广了拜耳 B+ 树结构的思想,R 树由此诞生。请注意,“R”代表矩形,第一个用例实际上索引矩形。当然,R 树如今也被广泛用于通过索引其他空间对象的边界框来索引它们。

3.  The addition of features to Ingres in support of CAD data management [Guttman and Stonebraker 1982]. The best-known result of that effort, by far, was due to my now famous (but soft-spoken) officemate Toni Guttman: the invention of the R-Tree index structure [Stonebraker and Guttman 1984]. Toni was studying the problem of storing VLSI CAD data from the emerging Caltech Mead-Conway VLSI design era. To simplify the problem of designing very large-scale integrated circuit chips, a new approach had emerged—thus filling the world with millions of rectangles to be managed! In order to index the 2-D geometry of a VLSI design, Toni generalized the ideas from Bayer’s B+ Tree structure, and R-Trees were thus born. Note that “R” stood for rectangle, and the first use case was for actually indexing rectangles. Of course, R-Trees are also widely used today to index other spatial objects by indexing their bounding boxes.

4.后来成为对象关系数据库主要特征的基础:支持基于抽象数据类型(ADT)的用户定义的 Ingres 扩展 [Ong 等人。1984],受到可能向 Ingres 添加空间数据支持的需求的启发。(有关 Mike 的对象关系贡献的其他观点,请参阅Phil Bernstein 的第 3 章、 Mike Olson 的第 12 章、 Joe Hellerstein 的第 16 章和第 6 章作者:David De-Witt。)由一对硕士生 [Fogg 1982,Ong 1982] 开发,ADT-Ingres 原型(从未成为 Ingres 公立大学版本的一部分)允许 Ingres 用户声明新类型,其实例可以然后作为属性值存储在声明为“属于”这些类型的关系属性中。每个这样的 ADT 定义都需要指定 ADT 的名称、二进制和序列化字符串形式的上限大小、字符串到二进制(输入)和二进制到字符串(输出)函数名称及其实现文件。还可以定义新的运算符,以便在 ADT 实例上进行操作,并且它们可以在其签名中涉及已知的原始类型或(此或其他)ADT。根据该工作的经验教训,Mike 本人开发了一种针对 ADT 值进行索引的设计,他关于该主题的论文 [Stonebraker 1986b] 最终为他赢得了 IEEE 国际数据工程会议颁发的“测试时间奖”。同样,对于历史记录,以下 Quel 语句序列(参见图 15.3 ) 直接取自 [Ong 等人。1984]使用添加数据类型来处理复数数据的示例说明了 ADT-Ingres 对 ADT 支持的本质。

4. The foundation for what would later become a major feature of object-relational databases: support for user-defined extensions to Ingres based on Abstract Data Types (ADTs) [Ong et al. 1984], inspired by requirements for potentially adding spatial data support to Ingres. (For other views on Mike’s object-relational contributions, see Chapter 3 by Phil Bernstein, Chapter 12 by Mike Olson, Chapter 16 by Joe Hellerstein, and Chapter 6 by David De-Witt.) Developed by a pair of master’s students [Fogg 1982, Ong 1982], the ADT-Ingres prototype—which never became part of the public university release of Ingres—allowed Ingres users to declare new types whose instances could then be stored as attribute values in relational attributes declared to be “of” those types. Each such ADT definition needed to specify the name of the ADT, its upper bound sizes in binary and serialized string form, and string-to-binary (input) and binary-to-string (output) function names and their implementation file. New operators could also be defined, in order to operate on ADT instances, and they could involve either known primitive types or (this or other) ADTs in their signatures. Based on lessons from that work, Mike himself developed a design for indexing on ADT values, and his paper on that topic [Stonebraker 1986b] eventually earned him a Test-of-Time award from the IEEE International Conference on Data Engineering. Again, for the historical record, the following sequence of Quel statements (see Figure 15.3) taken directly from [Ong et al. 1984] illustrates the nature of the ADT-Ingres support for ADTs using an example of adding a data type to handle complex number data.

图像

图 15.3   ADT-Ingres 对 ADT 的支持:添加数据类型来处理复数数据的示例。资料来源:Ong 等人。[1984]。

Figure 15.3  ADT-Ingres support for ADTs: Example of adding a data type to handle complex number data. Source: Ong et al. [1984].

5. 一些面向 Ingres 的“人工智能和数据库”或“专家数据库系统”(例如,Stonebraker 等人 [1982a]、Stonebraker 等人[1983c] 和 Stonebraker [1985b])。这以多种规则系统设计(触发器等)的形式出现,Mike 提议将其作为可以添加到关系 DBMS 中的机制,用于添加除数据之外的知识。高级应用程序。迈克实际上花了很多年的时间从​​不同的角度研究数据库中的规则。

5.  Several Ingres-oriented takes on “AI and databases” or “expert database systems” (e.g., Stonebraker et al. [1982a], Stonebraker et al. [1983c], and Stonebraker [1985b]). This came in the form of several rule system designs—triggers and beyond—that Mike proposed as machinery that could be added to a relational DBMS for use in adding knowledge in addition to data for advanced applications. Mike actually spent quite a few years looking in part at rules in databases from different angles.

6. 对数据库社区扩展关系数据库以支持复杂对象的存储和检索的愿望的“Mikespecial”响应。与其向支持对象和身份的方向扩展安格尔——“指针意大利面!” 在 Mike 看来,他建议添加一种功能,“简单地”将 Quel 查询添加到可以存储在 Ingres 关系属性中的事物列表中 [Stonebraker 1984]。要引用一个对象,可以编写一个查询来获取它,然后可以将该查询作为 Quel 类型的值存储在引用对象的属性中。(在我当时看来,可怕的东西:本质上是“查询意大利面条!” ”)Mike 还建议允许定义此类查询,但通过在模式级别指定 Quel 查询来按元组定制,但允许使用从元组的其他属性中提取的值对给定元组进行参数化。(更好的!)

6.  A “Mike special” response to the database community’s desire to extend relational databases to support the storage and retrieval of complex objects. Instead of extending Ingres in the direction of supporting objects and identity—“pointer spaghetti!” in Mike’s opinion—he proposed adding a capability to “simply” add Quel queries to the list of things that one could store in an attribute of a relation in Ingres [Stonebraker 1984]. To refer to an object, one could write a query to fetch it, and could then store that query as a value of type Quel in an attribute of the referring object. (Scary stuff, in my view at the time: essentially “query spaghetti!”) Mike also proposed allowing such queries to be defined but customized per tuple by specifying the Quel query at the schema level but letting it be parameterized for a given tuple using values drawn from the tuple’s other attributes. (Better!)

在“概念验证工作”的最后阶段(现在受到这些经验的启发,并受到其相对成功和失败的启发),Mike 提出了下一个大项目:即 Postgres [Stonebraker 和 Rowe 1986]。当我完成论文工作的最后阶段时,“Postgres 人员”开始到来。此后不久,这次迈克与拉里·罗一起工作,伯克利数据库系统的另一场大冒险(在下一章中描述!)开始了……

Out of this last phase of “proof-of-concept efforts”—now informed by these experiences and inspired by their relative successes and failures—rose the Next Big Project for Mike: namely, Postgres [Stonebraker and Rowe 1986]. As I was living through the final phase of my thesis work, the “Postgres guys” started to arrive. Soon thereafter, this time with Mike working with Larry Rowe, another big Berkeley database system adventure (described in the next chapter!) got under way …

现在回想起来,我意识到自己非常幸运:我基本上随机走到了迈克的家门口,当我到达那里时,他的门是开着的。(这会教你,迈克!永远不要在门开着的情况下工作……)Ingres 系统是我进入数据库的教育途径,我在伯克利的时候,迈克和我周围的其他人正在铺设许多数据库。这些基础经受住了时间的考验,其重要性现在得到了认可,成为当之无愧的 2014 年图灵奖授予迈克的一部分。

Looking back now, I realize that I was incredibly fortunate: An essentially random walk led me to Mike’s doorstep and his door was open when I got there. (That’ll teach you, Mike! Never, ever work with your door open …) The Ingres system was my educational on ramp to databases, and I got to be at Berkeley at a time when Mike and others around me were laying many of the foundations that would stand the test of time and be recognized for their significance now as part of the much-deserved presentation of the 2014 Turing Award to Mike.

当时在安格尔的土地上接受迈克的建议对我自己的职业生涯产生了巨大的影响。从那时起,我一直试图在我的教学中传达有关数据库系统领域的相同信息(数据库系统无聊?不是!!)。我还不可避免地感染了“系统构建错误”(由于两次接触迈克,后来又接触我的威斯康星州同事兼导师大卫·德威特),这种情况一直持续到今天。

Being in Ingres-land and being advised by Mike at that time has influenced my own career tremendously. I have since always sought to impart the same message about the database systems field in my teaching (database systems boring? not!!). I was also incurably infected with the “systems building bug” (due to double exposure to Mike and later to my Wisconsin colleague and mentor David DeWitt), which persists to this day.

1 . 一个鲜为人知的事实是,高性能关系数据仓库领域的大猩猩系统 Teradata 在转换为 SQL 之前最初是基于 Quel 的。

1. It’s a little-known fact that Teradata, the big gorilla system in high-performance relational data warehousing, was initially based on Quel before converting to SQL.

16

16

回顾 Postgres

Looking Back at Postgres

约瑟夫·海勒斯坦

Joseph M. Hellerstein

Postgres 是 Michael Stonebraker 最雄心勃勃的项目——他为构建一个通用的数据库系统所做的巨大努力。在长达十年的时间里,它产生的论文、博士学位、教授和公司比他所做的任何事情都多。它还比他构建的任何其他单一系统涵盖了更多的技术基础。尽管承担这一范围存在固有的风险,Postgres 也成为了 Stonebraker 研究小组最成功的软件工件,也是他对开源的主要贡献。截至撰写本文时,开源 PostgreSQL 系统是世界上最流行的独立开源数据库系统,也是世界上第四大最受欢迎的数据库系统。与此同时,以 Postgres 为基础建立的公司的收购总额已超过 26 亿美元。无论如何衡量,

Postgres was Michael Stonebraker’s most ambitious project—his grand effort to build a one-size-fits-all database system. A decade long, it generated more papers, Ph.Ds., professors, and companies than anything else he did. It also covered more technical ground than any other single system he built. Despite the risk inherent in taking on that scope, Postgres also became the most successful software artifact to come out of Stonebraker’s research groups, and his main contribution to open source. As of the time of writing, the open-source PostgreSQL system is the most popular, independent open-source database system in the world, and the fourth most popular database system in the world. Meanwhile, companies built from a Postgres base have generated a sum total of over $2.6 billion in acquisitions. By any measure, Stonebraker’s Postgres vision resulted in enormous and ongoing impact.

语境

Context

Stonebraker 在伯克利的 Ingres 研究项目(参见第 15 章)以及随后与 Larry Rowe 和 Eugene Wong 创立的初创公司:Relational Technology, Inc. (RTI) 中取得了巨大的成功。

Stonebraker had enormous success in his early career with the Ingres research project at Berkeley (see Chapter 15), and the subsequent startup he founded with Larry Rowe and Eugene Wong: Relational Technology, Inc. (RTI).

随着 RTI 在 20 世纪 80 年代初的发展,Stonebraker 开始致力于为 Codd 原始关系模型的传统行和列之外的数据类型提供数据库支持。当时的一个激动人心的例子是为微电子行业的 CAD 工具提供数据库支持。在 1983 年的一篇论文中,Stonebraker 和学生 Brad Rubenstein 和 Antonin Guttman 解释了行业如何需要对“多边形、矩形、文本字符串等新数据类型”、“高效空间搜索”、“复杂完整性约束”的支持,以及相同物理结构的“设计层次结构和多重表示”[Stonebraker 1983a]。基于这些动机,该小组开始了索引工作(包括 Guttman 颇具影响力的空间索引 R 树;[Guttman 1984]),数据库系统。ADT 是当时一种流行的新型编程语言结构,由随后的图灵奖获得者 Barbara Liskov 首创,并由 Stonebraker 的新合作者 Larry Rowe 在数据库应用程序编程中进行了探索。1984 年 SIGMOD Record 中的一篇论文 [Ong 等人。1984],Stonebraker 和学生 James Ong 和 Dennis Fogg 描述了对这一想法的探索,作为 Ingres 的扩展,称为 ADT-Ingres,其中包括许多在 Postgres 中更深入探索并得到更多系统支持的表征想法。

As RTI was developing in the early 1980s, Stonebraker began working on database support for data types beyond the traditional rows and columns of Codd’s original relational model. A motivating example current at the time was to provide database support for CAD tools for the microelectronics industry. In a 1983 paper, Stonebraker and students Brad Rubenstein and Antonin Guttman explained how that industry-needed support for “new data types such as polygons, rectangles, text strings, etc.,” “efficient spatial searching,” “complex integrity constraints,” and “design hierarchies and multiple representations” of the same physical constructions [Stonebraker 1983a]. Based on motivations such as these, the group started work on indexing (including Guttman’s influential R-trees for spatial indexing; [Guttman 1984]), and on adding Abstract Data Types (ADTs) to a relational database system. ADTs were a popular new programming language construct at the time, pioneered by subsequent Turing Award winner Barbara Liskov and explored in database application programming by Stonebraker’s new collaborator, Larry Rowe. In a paper in SIGMOD Record in 1984 [Ong et al. 1984], Stonebraker and students James Ong and Dennis Fogg describe an exploration of this idea as an extension to Ingres called ADT-Ingres, which included many of the representational ideas that were explored more deeply—and with more system support—in Postgres.

Postgres:概述

Postgres: An Overview

正如其名称所示,Postgres 是“Post-Ingres”:一个旨在继承 Ingres 所能做到并超越的系统。Postgres 的标志性主题是引入他最终称之为对象关系数据库功能:在数据库系统的数据模型和声明性查询语言中支持面向对象的编程思想。但 Stonebraker 还决定在 Postgres 中追求一些独立于面向对象支持的其他技术挑战,包括活动数据库规则、版本化数据、三级存储和并行性。

As indicated by the name, Postgres was “Post-Ingres”: a system designed to take what Ingres could do and go beyond. The signature theme of Postgres was the introduction of what he eventually called Object-Relational database features: support for object-oriented programming ideas within the data model and declarative query language of a database system. But Stonebraker also decided to pursue a number of other technical challenges in Postgres that were independent of object-oriented support, including active database rules, versioned data, tertiary storage, and parallelism.

关于 Postgres 的设计写了两篇论文:SIGMOD 1986 [Stonebraker and Rowe 1986] 中的早期设计和 CACM 1991 [Stonebraker and Kemnitz 1991] 中的“飞行中”设计描述。1992 年,随着 Stonebraker 的 Illustra 初创公司的成立,Postgres 研究项目逐渐缩减,该公司的主要博士生 Stonebraker 也参与了该项目。学生Wei Hong,以及当时的首席程序员Jeff Meredith(见第25章)。下面,1986年论文中提到的特征用星号(*)标记;1991 年论文中包含在 1986 年论文中的内容用加号 (+) 标记。下面列出的其他目标已在系统和研究文献中解决,但在任何设计论文中都没有解决:

Two papers were written on the design of Postgres: an early design in SIGMOD 1986 [Stonebraker and Rowe 1986] and a “mid-flight” design description in CACM 1991 [Stonebraker and Kemnitz 1991]. The Postgres research project ramped down in 1992 with the founding of Stonebraker’s Illustra startup, which involved Stonebraker, key Ph.D. student Wei Hong, and then-chief programmer Jeff Meredith (see Chapter 25). Below, the features mentioned in the 1986 paper are marked with an asterisk (*); those from the 1991 paper that were not in the 1986 paper are marked with a plus sign (+). Other goals listed below were tackled in the system and the research literature, but not in either design paper:

1. 在数据库系统中支持ADT

1.  Supporting ADTs in a Database System

(a) 复杂对象(即嵌套或非第一范式数据)*

(a)  Complex Objects (i.e., nested or non-first-normal form data)*

(b) 用户定义的抽象数据类型和函数*

(b)  User-Defined Abstract Data Types and Functions*

(c) 新数据类型的可扩展访问方法*

(c)  Extensible Access Methods for New Data Types*

(d) 使用昂贵的用户定义函数的查询的优化器处理

(d)  Optimizer Handling of Queries with Expensive User-Defined Functions

2. 活动数据库和规则系统(触发器、警报)*

2.  Active Databases and Rules Systems (Triggers, Alerts)*

(a) 作为查询重写实现的规则+

(a)  Rules implemented as query rewrites+

(b) 作为记录级触发器实施的规则+

(b)  Rules implemented as record-level triggers+

3.  以日志为中心的存储和恢复

3.  Log-centric Storage and Recovery

(a) 通过将日志视为数据来降低恢复代码的复杂性,*使用非易失性内存来获取提交状态+

(a)  Reduced-complexity recovery code by treating the log as data,* using non-volatile memory for commit status+

(b) 无覆盖存储和时间旅行查询+

(b)  No-overwrite storage and time travel queries+

4. 支持查询新深度存储技术上的数据,尤其是光盘*

4.  Support for querying data on new deep storage technologies, notably optical disks*

5. 支持多处理器或自定义处理器*

5.  Support for multiprocessors or custom processors*

6.支持多种语言模型

6.  Support for a variety of language models

(a) 对关系模型进行最小的更改并支持声明性查询*

(a)  Minimal changes to the relational model and support for declarative queries*

(b) 暴露对内部 API 的“快速路径”访问,绕过查询语言+

(b)  Exposure of “fast path” access to internal APIs, bypassing the query language+

(c) 多语言支持+

(c)  Multi-lingual support+

其中许多主题在被其他人研究或重新发明之前就已经在 Postgres 中得到了解决;在很多情况下,Postgres 都远远领先于它的时代,而这些想法后来随着当代的转变而火热起来。

Many of these topics were addressed in Postgres well before they were studied or reinvented by others; in many cases, Postgres was too far ahead of its time and the ideas caught fire later, with a contemporary twist.

我们简要讨论 Postgres 的每一项贡献,以及与后续计算工作的联系。

We briefly discuss each of these Postgres contributions, and connections to subsequent work in computing.

在数据库系统中支持 ADT

Supporting ADTs in a Database System

Postgres 的标志性目标是支持新的对象关系功能:扩展数据库技术以支持关系查询处理和面向对象编程的优点的组合。随着时间的推移,Postgres 中开创的对象关系思想已经成为大多数现代数据库系统的标准功能。

The signature goal of Postgres was to support new Object-Relational features: the extension of database technology to support a combination of the benefits of relational query processing and object-oriented programming. Over time the Object-Relational ideas pioneered in Postgres have become standard features in most modern database systems.

A. 复杂对象

A. Complex Objects

数据以嵌套束或“对象”的形式表示是很常见的。一个典型的示例是采购订单,其中包含一组嵌套的产品、数量和价格。关系建模宗教规定,此类数据应使用多个平面实体表(订单、产品)以及连接它们的平面关系表(product_in_order)来重组并以非嵌套格式存储。这种扁平化的典型原因是它减少了数据的重复(在许多采购订单中冗余描述的产品),从而避免了更新所有冗余副本的复杂性或错误。但在某些情况下,您希望存储嵌套表示,因为这对于应用程序(例如 CAD 工具中的电路布局引擎),并且更新很少。这种数据建模争论至少与关系模型一样古老。

It is quite common for data to be represented in the form of nested bundles or “objects.” A classic example is a purchase order, which has a nested set of products, quantities, and prices in the order. Relational modeling religion dictated that such data should be restructured and stored in an unnested format, using multiple flat entity tables (orders, products) with flat relationship tables (product_in_order) connecting them. The classic reason for this flattening is that it reduces duplication of data (a product being described redundantly in many purchase orders), which in turn avoids complexity or errors in updating all redundant copies. But in some cases, you want to store the nested representation, because it is natural for the application (say, a circuit layout engine in a CAD tool), and updates are rare. This data modeling debate is at least as old as the relational model.

从数据建模的角度来看,Postgres 的一个关键方面是“鱼与熊掌兼得”:Postgres 保留表作为其“最外层”数据类型,但允许列具有“复杂”类型,包括嵌套元组或表。它的一个更深奥的实现,首先在 ADT-Ingres 原型中探索,是允许以声明方式指定表类型列作为查询定义:“Quel 作为数据类型”[Stonebraker 等人,2017]。1984a]。

A key aspect of Postgres was to “have your cake and eat it too” from a data modeling perspective: Postgres retained tables as its “outermost” data type but allowed columns to have “complex” types including nested tuples or tables. One of its more esoteric implementations, first explored in the ADT-Ingres prototype, was to allow a table-typed column to be specified declaratively as a query definition: “Quel as a data type” [Stonebraker et al. 1984a].

支持声明性查询和嵌套数据的“后关系”主题多年来反复出现——通常是关于哪一个更好的争论的结果。在 20 世纪 80 年代和 90 年代的 Postgres 时代,一些面向对象的数据库团体采纳了这个想法,并将其发展为一种名为 OQL 的标准语言,但此后该语言已不再使用。

The “post-relational” theme of supporting both declarative queries and nested data has recurred over the years—often as an outcome of arguments about which is better. At the time of Postgres in the 1980s and 1990s, some of the object-oriented database groups picked up the idea and pursued it to a standard language called OQL, which has since fallen from use.

世纪之交,对嵌套对象的声明式查询以 XML 数据库的形式成为数据库社区中一部分人的研究热点。由此产生的 XQuery 语言(由 SQL 名声大噪的 Don Chamberlin 领导)得益于 Postgres 的 PostQuel 语言中的复杂对象支持。XQuery 在业界得到了广泛的采用和实施,但从未受到用户的欢迎。今天,在基于浏览器的应用程序中流行的 JSON 数据模型的查询语言设计中,这些想法再次被重新审视。与 OQL 一样,这些语言在许多情况下是团体的事后想法,这些团体最初拒绝声明式查询,转而支持以开发人员为中心的编程(“NoSQL”运动),只是想事后将查询添加回系统。同时,随着 Postgres 多年来的发展(并将语法从 PostQuel 转变为反映许多这些目标的 SQL 版本),它将对 XML 和 JSON 等嵌套数据的支持纳入通用 DBMS 中,而不需要任何重大的重新架构。这场战斗来回摇摆,但 Postgres 通过嵌套数据扩展来扩展关系框架的方法一次又一次地表明,在争论平息后,这是各方自然的最终状态。

Around the turn of the millennium, declarative queries over nested objects became a research obsession for a segment of the database community in the guise of XML databases; the resulting XQuery language (headed by Don Chamberlin of SQL fame) owes a debt to the complex object support in Postgres’ PostQuel language. XQuery had broad adoption and implementation in industry, but never caught on with users. The ideas are being revisited yet again today in query language designs for the JSON data model popular in browser-based applications. Like OQL, these languages are in many cases an afterthought in groups that originally rejected declarative queries in favor of developer-centric programming (the “NoSQL” movement), only to want to add queries back to the systems post hoc. In the meantime, as Postgres has grown over the years (and shifted syntax from PostQuel to versions of SQL that reflect many of these goals), it has incorporated support for nested data like XML and JSON into a general-purpose DBMS without requiring any significant rearchitecting. The battle swings back and forth, but the Postgres approach of extending the relational framework with extensions for nested data has shown time and again to be a natural end-state for all parties after the arguments subside.

B. 用户定义的抽象数据类型和函数

B. User-defined Abstract Data Types and Functions

除了提供嵌套类型之外,Postgres 还率先提出了不透明、可扩展的抽象数据类型 (ADT) 的想法,这些类型存储在数据库中,但不由核心数据库系统解释。原则上,这始终是 Codd 关系模型的一部分:整数和字符串是传统的,但实际上任何带有谓词的原子数据类型都可以在关系模型中捕获。挑战是在软件中提供数学灵活性。为了启用解释和操作这些对象的查询,应用程序程序员需要能够向系统注册这些类型的用户定义函数 (UDF),并能够在查询中调用这些 UDF。用户定义聚合 (UDA) 函数也可用于汇总查询中这些对象的集合。Postgres 是全面支持这些功能的开创性数据库系统。

In addition to offering nested types, Postgres pioneered the idea of having opaque, extensible Abstract Data Types (ADTs), which are stored in the database but not interpreted by the core database system. In principle, this was always part of Codd’s relational model: integers and strings were traditional, but really any atomic data types with predicates can be captured in the relational model. The challenge was to provide that mathematical flexibility in software. To enable queries that interpret and manipulate these objects, an application programmer needs to be able to register User-Defined Functions (UDFs) for these types with the system and be able to invoke those UDFs in queries. User-Defined Aggregate (UDA) functions are also desirable to summarize collections of these objects in queries. Postgres was the pioneering database system supporting these features in a comprehensive way.

为什么要将此功能放入 DBMS 中,而不是上面的应用程序中?经典的答案是“将代码推送到数据”而不是“将数据拉到代码”具有显着的性能优势。Postgres 表明,这在关系框架中是非常自然的:它涉及对关系元数据目录的适度更改,以及调用外部代码的机制,但查询语法、语义和系统架构都简单而优雅。

Why put this functionality into the DBMS, rather than the applications above? The classic answer was the significant performance benefit of “pushing code to data,” rather than “pulling data to code.” Postgres showed that this is quite natural within a relational framework: It involved modest changes to a relational metadata catalog, and mechanisms to invoke foreign code, but the query syntax, semantics, and system architecture all worked out simply and elegantly.

Postgres 在探索这个功能方面有点超前了。特别是,当时数据库研究界并未积极关注将不安全代码上传到服务器的安全影响。当该技术开始在工业界受到关注时,这就成为了问题。Stonebraker 在他的 Illustra 初创公司中将 Postgres 商业化,该公司被 Informix 收购,很大程度上是因为它能够支持包括 UDF 在内的可扩展“DataBlades”(扩展包)。Informix 基于 Postgres 的技术与其强大的并行数据库产品相结合,使 Informix 对 Oracle 构成了重大威胁。Oracle 投入巨资进行负面营销,宣传 Informix 运行“不受保护”的用户定义 C 代码的能力所带来的风险。有些人将 Informix 的消亡归因于这次活动,尽管 Informix 的财务恶作剧(以及随后对其时任首席执行官的联邦起诉)肯定更成问题。现在,几十年后,所有主要数据库供应商都支持以一种或多种语言执行用户定义的函数,并使用更新的技术来防止服务器崩溃或数据损坏。

Postgres was a bit ahead of its time in exploring this feature. In particular, the security implications of uploading unsafe code to a server were not an active concern in the database research community at the time. This became problematic when the technology started to get noticed in industry. Stonebraker commercialized Postgres in his Illustra startup, which was acquired by Informix in large part for its ability to support extensible “DataBlades” (extension packages) including UDFs. Informix’s Postgres-based technology, combined with its strong parallel database offering, made Informix a significant threat to Oracle. Oracle invested heavily in negative marketing about the risks of Informix’s ability to run “unprotected” user-defined C code. Some trace the demise of Informix to this campaign, although Informix’s financial shenanigans (and subsequent federal indictment of its then-CEO) were certainly more problematic. Now, decades later, all the major database vendors support the execution of user-defined functions in one or more languages, using newer technologies to protect against server crashes or data corruption.

与此同时,2000 年代的大数据堆栈——包括让 Stonebraker 和 DeWitt 心痛不已的 MapReduce 现象 [DeWitt and Stonebraker 2008]——是在查询框架中托管用户定义代码的 Postgres 理念的重新实现。MapReduce 看起来非常像 Postgres 的软件工程思想与 Gamma 和 Teradata 等系统的并行思想的结合,并针对极端可扩展性工作负载的查询中重新启动进行了一些小的创新。基于 Postgres 的初创公司 Greenplum 和 Aster 在 2007 年左右表明,对于大多数客户而言,并行化 Postgres 可以带来比 MapReduce 更高的功能和实用性,但 2008 年市场仍然没有为任何这项技术做好准备。 2018年,几乎每大数据堆栈主要服务于带有 UDF 的并行 SQL 工作负载——非常类似于 Stonebraker 和团队在 Postgres 中首创的设计。

Meanwhile, the Big Data stacks of the 2000s—including the MapReduce phenomenon that gave Stonebraker and DeWitt such heartburn [DeWitt and Stonebraker 2008]–are a re-realization of the Postgres idea of user-defined code hosted in a query framework. MapReduce looks very much like a combination of software engineering ideas from Postgres combined with parallelism ideas from systems like Gamma and Teradata, with some minor innovation around mid-query restart for extreme-scalability workloads. Postgres-based start-ups Greenplum and Aster showed around 2007 that parallelizing Postgres could result in something much higher function and practical than MapReduce for most customers, but the market still wasn’t ready for any of this technology in 2008. By now, in 2018, nearly every Big Data stack primarily serves a workload of parallel SQL with UDFs—very much like the design Stonebraker and team pioneered in Postgres.

C. 新数据类型的可扩展访问方法

C. Extensible Access Methods for New Data Types

20 世纪 70 年代初,关系数据库与 B 树大约同时发展,B 树帮助实现了 Codd 的“物理数据独立性”梦想:B 树索引提供了一定程度的间接性,可以自适应地重新组织物理存储,而无需更改应用程序。B 树和相关结构的主要限制是它们仅支持相等查找和一维范围查询。如果您有地图和 CAD 应用程序中常见的二维范围查询该怎么办?这个问题很及时在 Postgres 时代,Stonebraker 小组的 Antonin Guttman 开发的 R 树是在实践中为解决这个问题而开发的最成功的新索引之一。尽管如此,索引结构的发明并没有解决DBMS支持多维范围查询的端到端系统问题。出现了很多问题。您可以轻松地将 R 树这样的访问方法添加到您的 DBMS 中吗?你能告诉你的优化器所说的访问方法对于某些查询有用吗?你能保证并发和恢复正确吗?

Relational databases evolved around the same time as B-trees in the early 1970s, and B-trees helped fuel Codd’s dream of “physical data independence”: B-tree indexes provide a level of indirection that adaptively reorganizes physical storage without requiring applications to change. The main limitation of B-trees and related structures was that they only support equality lookups and one-dimensional range queries. What if you have two-dimensional range queries of the kind typical in mapping and CAD applications? This problem was au courant at the time of Postgres, and the R-tree developed by Antonin Guttman in Stonebraker’s group was one of the most successful new indexes developed to solve this problem in practice. Still, the invention of an index structure does not solve the end-to-end systems problem of DBMS support for multi-dimensional range queries. Many questions arise. Can you add an access method like R-trees to your DBMS easily? Can you teach your optimizer that said access method will be useful for certain queries? Can you get concurrency and recovery correct?

这是 Postgres 议程中一个非常雄心勃勃的方面:影响数据库引擎大部分内容的软件架构问题,从优化器到存储层以及日志记录和恢复系统。R-trees 成为一个强大的驱动程序,也是 Postgres 访问方法层优雅可扩展性及其与查询优化器集成的主要示例。Postgres 以不透明的 ADT 风格演示了如何注册抽象描述的访问方法(在本例中为 R 树),以及查询优化器如何识别抽象选择谓词(在本例中为范围选择)并进行匹配到抽象描述的访问方法。并发控制问题在最初的工作中并不是重点:键上缺乏一维排序使得 B 树式锁定不适用。1

This was a very ambitious aspect of the Postgres agenda: a software architecture problem affecting most of a database engine, from the optimizer to the storage layer and the logging and recovery system. R-trees became a powerful driver and the main example of the elegant extensibility of Postgres’ access method layer and its integration into the query optimizer. Postgres demonstrated—in an opaque ADT style—how to register an abstractly described access method (the R-tree, in this case), and how a query optimizer could recognize an abstract selection predicate (a range selection in this case) and match it to that abstractly described access method. Questions of concurrency control were less of a focus in the original effort: The lack of a unidimensional ordering on keys made B-tree-style locking inapplicable.1

今天的 PostgreSQL 既利用了可扩展访问方法的原始软件架构(它具有 B 树、GiST、SP-GiST 和 Gin 索引),也利用了通用搜索树 (GiST) 接口的可扩展性和高并发性。GiST 索引为流行的基于 PostgreSQL 的 PostGIS 地理信息系统提供支持;Gin 索引为 PostgreSQL 的内部文本索引支持提供支持。

PostgreSQL today leverages both the original software architecture of extensible access methods (it has B-tree, GiST, SP-GiST, and Gin indexes) and the extensibility and high concurrency of the Generalized Search Tree (GiST) interface as well. GiST indexes power the popular PostgreSQL-based PostGIS geographic information system; Gin indexes power PostgreSQL’s internal text indexing support.

D. 具有昂贵 UDF 的查询的优化器处理

D. Optimizer Handling of Queries with Expensive UDFs

在传统的查询优化中,挑战通常是最小化处理查询时生成的元组流(以及 I/O)量。这意味着过滤元组(选择)的运算符最好在查询计划的早期执行,而可以生成新元组(连接)的运算符应该稍后执行。因此,查询优化器会将选择“推”到连接下方并任意排序,而是专注于巧妙地优化连接和磁盘访问。UDF 改变了这一点:如果您选择了昂贵的 UDF,则执行 UDF 的顺序对于优化性能至关重要。此外,如果选择中的 UDF 确实非常耗时,那么它可能应该连接之后发生(即选择“上拉”)。这样做可以优化优化器空间的复杂性。

In traditional query optimization, the challenge was generally to minimize the amount of tuple-flow (and hence I/O) you generate in processing a query. This meant that operators that filter tuples (selections) are good to do early in the query plan, while operators that can generate new tuples (join) should be done later. As a result, query optimizers would “push” selections below joins and order them arbitrarily, focusing instead on cleverly optimizing joins and disk accesses. UDFs changed this: if you have expensive UDFs in your selections, the order of executing UDFs can be critical to optimizing performance. Moreover, if a UDF in a selection is really time consuming, it’s possible that it should happen after joins (i.e., selection “pullup”). Doing this optimally complicated the optimizer space.

我把这个问题作为我在研究生院的第一个挑战,它最终成为我在伯克利分校斯通布雷克的硕士学位和我的博士学位的主题。在威斯康星州,杰夫·诺顿 (Jeff Naughton) 领导下,斯通布雷克 (Stonebraker) 不断提供意见。Postgres 是第一个捕获数据库目录中 UDF 的成本和选择性的 DBMS。我们通过提出选择的最佳排序来解决优化问题,然后沿着计划搜索期间考虑的每个连接树的分支对选择进行最佳交错。这允许优化器维护 System R 的教科书动态编程架构,只需少量的额外排序成本即可正确排序昂贵的选择。2

I took on this problem as my first challenge in graduate school and it ended up being the subject of both my M.S. with Stonebraker at Berkeley and my Ph.D. at Wisconsin under Jeff Naughton, with ongoing input from Stonebraker. Postgres was the first DBMS to capture the costs and selectivities of UDFs in the database catalog. We approached the optimization problem by coming up with an optimal ordering of selections, and then an optimal interleaving of the selections along the branches of each join tree considered during plan search. This allowed for an optimizer that maintained the textbook dynamic programming architecture of System R, with a small additional sorting cost to get the expensive selections ordered properly.2

昂贵的函数优化功能很早就在 PostgreSQL 源代码树中被禁用,很大程度上是因为当时对于昂贵的用户定义函数没有引人注目的用例。3我们使用的示例围绕图像处理展开,最终成为 2018 年的主流数据处理任务。当然,在大数据和机器学习工作负载的今天,昂贵的功能已经变得相当普遍,我预计这个问题会回归到前面。Postgres 再一次远远领先于时代。

The expensive function optimization feature was disabled in the PostgreSQL source trees early on, in large part because there weren’t compelling use cases at that time for expensive user-defined functions.3 The examples we used revolved around image processing and are finally becoming mainstream data processing tasks in 2018. Of course, today in the era of Big Data and machine learning workloads, expensive functions have become quite common, and I expect this problem to return to the fore. Once again, Postgres was well ahead of its time.

活动数据库和规则系统

Active Databases and Rule Systems

Postgres 项目始于人工智能社区对基于规则的编程的兴趣,作为在“专家系统”中表示知识的一种方式。这种想法并不成功。许多人表示,这导致了 20 世纪 90 年代备受讨论的“人工智能冬天”。

The Postgres project began at the tail end of the AI community’s interest in rule-based programming as a way to represent knowledge in “expert systems.” That line of thinking was not successful; many say it led to the much discussed “AI winter” that persisted through the 1990s.

然而,规则编程在数据库社区中以两种形式持续存在。第一个是关于使用 Datalog 进行声明性逻辑编程的理论工作。这是斯通布雷克的心头大患。他似乎真的很讨厌这个话题,并且多年来在多份“社区”报告中批评了它。4第二个数据库规则议程是关于最终被称为活动数据库和数据库触发器的务实工作,它们逐渐发展成为关系数据库的标准功能。斯通布雷克典型地用脚投票赞成开发更务实的变体。

However, rule programming persisted in the database community in two forms. The first was theoretical work around declarative logic programming using Datalog. This was a bugbear of Stonebraker’s; he really seemed to hate the topic and famously criticized it in multiple “community” reports over the years.4 The second database rules agenda was pragmatic work on what was eventually dubbed Active Databases and Database Triggers, which evolved to be a standard feature of relational databases. Stonebraker characteristically voted with his feet to work on the more pragmatic variant.

Stonebraker 在数据库规则方面的工作始于 Eric Hanson 的博士学位,最初针对 Ingres,但很快就转向了新的 Postgres 项目。它扩展到博士学位。Spyros Potamianos 在 PRS2 上的工作:Postgres 规则系统 2。这两种实现的主题都是以两种不同方式实现规则的潜力。一种选择是将规则视为查询重写,这让人想起 Stonebraker 在 Ingres 中开创的重写视图的工作。在这种情况下,“根据条件然后执行操作”的规则逻辑被重写为“根据查询然后重写为修改后的查询并执行它”。例如,“向 Mike 的奖励列表添加新行”之类的查询可能会重写为“将 Mike 的薪水提高 10%”。另一种选择是实施更实际的“先条件后行动”,”通过使用数据库内的锁来检查行级别的条件。当遇到此类锁时,结果不是等待(如传统并发控制中那样),而是执行关联的操作。5

Stonebraker’s work on database rules began with Eric Hanson’s Ph.D., which initially targeted Ingres but quickly transitioned to the new Postgres project. It expanded to the Ph.D. work of Spyros Potamianos on PRS2: Postgres Rules System 2. A theme in both implementations was the potential to implement rules in two different ways. One option was to treat rules as query rewrites, reminiscent of the work on rewriting views that Stonebraker pioneered in Ingres. In this scenario, a rule logic of “on condition then action” is recast as “on query then rewrite to a modified query and execute it instead.” For example, a query like “append a new row to Mike’s list of awards” might be rewritten as “raise Mike’s salary by 10%.” The other option was to implement a more physical “on condition then action,” checking conditions at a row level by using locks inside the database. When such locks were encountered, the result was not to wait (as in traditional concurrency control), but to execute the associated action.5

最终,查询重写方案和行级锁定方案都没有被宣布为 Postgres 中实现规则的“赢家”——两者都保留在发布的系统中。最终所有规则代码都被废弃并在 PostgreSQL 中重写,但当前源代码仍然保留了每语句和每行触发器的概念。

In the end, neither the query rewriting scheme nor the row-level locking scheme was declared a “winner” for implementing rules in Postgres—both were kept in the released system. Eventually all of the rules code was scrapped and rewritten in PostgreSQL, but the current source still retains both the notions of per-statement and per-row triggers.

Postgres 规则系统在当时非常有影响力,并且与 IBM 的 Starburst 项目和 MCC 的 HiPac 项目的研究“正面交锋”。如今,“触发器”已成为 SQL 标准的一部分,并在许多主要数据库引擎中实现。然而,它们的使用有点少。一个问题是,这项工作从未克服导致人工智能冬天的问题:随着规则集的增长,即使是适度增长,一堆规则内的交互也可能变得难以忍受的混乱。此外,触发器在实践中仍然往往比较耗时,因此必须快速运行的数据库安装往往会避免使用触发器。但在物化视图维护、复杂事件处理和流查询等相关领域一直存在家庭手工业,

The Postgres rules systems were very influential in their day and went “head to head” with research from IBM’s Starburst project and MCC’s HiPac project. Today, “triggers” are part of the SQL standard and implemented in many of the major database engines. They are used somewhat sparingly, however. One problem is that this body of work never overcame the issues that led to AI winter: The interactions within a pile of rules can become untenably confusing as the rule set grows even modestly. In addition, triggers still tend to be relatively time consuming in practice, so database installations that have to run fast tend to avoid the use of triggers. But there has been a cottage industry in related areas like materialized view maintenance, Complex Event Processing, and stream queries, all of which are in some way extensions of ideas explored in the Postgres rules systems.

以日志为中心的存储和恢复

Log-centric Storage and Recovery

Stonebraker 这样描述了他对 Postgres 存储系统的设计:

Stonebraker described his design for the Postgres storage system this way:

“在考虑 POSTGRES 存储系统时,我们怀着传教士般的热情,想做一些不同的事情。当前所有商业系统都使用带有预写日志(WAL)的存储管理器,我们认为这项技术很好理解。此外,20 世纪 70 年代的原始 Ingres 原型使用​​了类似的存储管理器,我们不想再进行其他实现。” [斯通布雷克和凯姆尼茨 1991]

“When considering the POSTGRES storage system, we were guided by a missionary zeal to do something different. All current commercial systems use a storage manager with a write-ahead log (WAL), and we felt that this technology was well understood. Moreover, the original Ingres prototype from the 1970s used a similar storage manager, and we had no desire to do another implementation.” [Stonebraker and Kemnitz 1991]

虽然这被视为纯粹的智力不安,但这项工作也有技术动机。多年来,Stonebraker 多次表达了对 IBM 和 Tandem 首创的用于数据库恢复的复杂预写日志记录方案的厌恶。他的核心反对意见之一是基于软件工程的直觉,即任何人都不应该依赖如此复杂的东西,尤其是对于仅在崩溃后罕见的关键情况下才会使用的功能。

While this is cast as pure intellectual restlessness, there were technical motivations for the work as well. Over the years, Stonebraker repeatedly expressed distaste for the complex write-ahead logging schemes pioneered at IBM and Tandem for database recovery. One of his core objections was based on a software engineering intuition that nobody should rely upon something that complicated—especially for functionality that would only be exercised in rare, critical scenarios after a crash.

Postgres 存储引擎将主存储和历史日志记录的概念统一为一个简单的基于磁盘的表示形式。从根本上来说,这个想法是将数据库中的每条记录保存在一个带有事务ID的版本链接列表中——在某种意义上,这是“日志作为数据”或“数据作为日志”,具体取决于你的观点看法。唯一需要的附加元数据是已提交事务 ID 和挂钟时间的列表。这种方法极大地简化了恢复,因为不需要从日志表示形式“转换”回主要表示形式。它还支持“时间旅行”查询:您可以“从”某个挂钟时间运行查询并访问当时提交的数据版本。Postgres 存储系统的原始设计(读起来很像 Stonebraker 在一次创造性的头脑风暴会议中编写的)考虑了许多效率问题和对此基本方案的优化,以及对性能如何发挥作用的一些湿手指分析出[Stonebraker 1987]。Postgres 中的最终实现稍微简单一些。

The Postgres storage engine unified the notion of primary storage and historical logging into a single, simple disk-based representation. At base, the idea was to keep each record in the database in a linked list of versions stamped with transaction IDs—in some sense, this is “the log as data” or “the data as a log,” depending on your point of view. The only additional metadata required is a list of committed transaction IDs and wall-clock times. This approach simplifies recovery enormously since there’s no “translating” from a log representation back to a primary representation. It also enables “time travel” queries: You can run queries “as of” some wall-clock time and access the versions of the data that were committed at that time. The original design of the Postgres storage system—which reads very much as if Stonebraker wrote it in one creative session of brainstorming—contemplated a number of efficiency problems and optimizations to this basic scheme, along with some wet-finger analyses of how performance might play out [Stonebraker 1987]. The resulting implementation in Postgres was somewhat simpler.

当数据库供应商通过大力投资高性能事务处理机制来使自己脱颖而出时,Stonebraker 的事务存储“彻底简单”的理念是非常反主流文化的。当时的基准测试获胜者通过高度优化、复杂的预写日志记录系统实现了高性能和恢复能力。一旦预写日志运行良好,供应商也开始对后续想法进行创新,例如基于日志传送的事务复制,这在 Postgres 方案中是很困难的。最终,Postgres 存储系统在性能上始终没有表现出色。随着时间的推移,版本控制和时间旅行已从 PostgreSQL 中删除,并被预写日志记录所取代。6但时间旅行功能很有趣并且仍然是独一无二的。此外,斯通布雷克的精神简单的恢复软件工程如今在 NoSQL 系统(选择复制而不是预写日志记录)和主内存数据库(通常使用多版本控制和压缩提交日志)的环境中都有回响。版本化关系数据库和时间旅行查询的想法今天仍然属于深奥的范畴,偶尔会出现在研究原型和小型开源项目中。在我们这个廉价存储和持续流数据的时代,这个想法的卷土重来的时机已经成熟。

Stonebraker’s idea of “radical simplicity” for transactional storage was deeply countercultural at the time when the database vendors were differentiating themselves by investing heavily in the machinery of high-performance transaction processing. Benchmark winners at the time achieved high performance and recover-ability via highly optimized, complex write-ahead logging systems. Once they had write-ahead logs working well, the vendors also began to innovate on follow-on ideas such as transactional replication based on log shipping, which would be difficult in the Postgres scheme. In the end, the Postgres storage system never excelled on performance; versioning and time travel were removed from PostgreSQL over time and replaced by write-ahead logging.6 But the time-travel functionality was interesting and remained unique. Moreover, Stonebraker’s ethos regarding simple software engineering for recovery has echoes today both in the context of NoSQL systems (which choose replication rather than write-ahead logging) and main-memory databases (which often use multi-versioning and compressed commit logs). The idea of versioned relational databases and time-travel queries are still relegated to esoterica today, popping up in occasional research prototypes and minor open-source projects. It is an idea that is ripe for a comeback in our era of cheap storage and continuously streaming data.

对新深度存储技术的质疑

Queries over New Deep Storage Technologies

在 Postgres 项目进行期间,Stonebraker 签署了一项名为“红杉计划”的数字地球科学大额资助项目,担任联合首席研究员。拨款提案的一部分是处理前所未有的大量数字卫星图像,需要高达 100 TB 的存储空间,远远超出了当时磁盘上可以合理存储的数据量。所提出的解决方案的核心是探索 DBMS(即 Postgres)的想法,该 DBMS 有助于访问由机器人“点唱机”提供的近线“三级”存储,用于管理光盘或磁带库。

In the middle of the Postgres project, Stonebraker signed on as a co-principal investigator on a large grant for digital earth science called Project Sequoia. Part of the grant proposal was to handle unprecedented volumes of digital satellite imagery requiring up to 100 terabytes of storage, far more data than could be reasonably stored on magnetic disks at the time. The center of the proposed solution was to explore the idea of a DBMS (namely Postgres) facilitating access to near-line “tertiary” storage provided by robotic “jukeboxes” for managing libraries of optical disks or tapes.

由此产生了一些不同的研究成果。其中之一是 Inversion 文件系统:致力于在RDBMS之上提供 UNIX 文件系统抽象。在红杉的一篇概述论文中,Stonebraker 以他一贯的漫不经心的风格将其描述为“一个简单的练习”[Stonebraker 1995]。在实践中,这让 Stonebraker 的学生(以及后来的 Cloudera 创始人)Mike Olson 忙碌了几年,最终的结果并不完全简单 [Olson 1993],也没有在实践中生存。7

A couple different research efforts came out of this. One was the Inversion file system: an effort to provide a UNIX filesystem abstraction above an RDBMS. In an overview paper for Sequoia, Stonebraker described this in his usual cavalier style as “a straightforward exercise” [Stonebraker 1995]. In practice, this kept Stonebraker student (and subsequent Cloudera founder) Mike Olson busy for a couple years, and the final result was not exactly straightforward [Olson 1993], nor did it survive in practice.7

这方面的另一个主要研究方向是将三级存储合并到更典型的关系数据库堆栈中,这是 Sunita Sarawagi 博士的课题。论文。主题是改变您考虑管理空间(即存储中的数据和内存层次结构)和时间(协调查询和缓存调度以最大程度地减少不需要的 I/O)的规模。这项工作的关键问题之一是存储和检索大型多维数据三级存储中的数组——与多维索引中的工作相呼应,基本思想包括将数组分解为块并将获取的块存储在一起——包括复制块以使给定数据块具有多个物理“邻居”。第二个问题是考虑磁盘如何成为三级存储的缓存。最后,查询优化和调度必须考虑三级存储的长加载时间以及磁盘缓存中“命中”的重要性——这会影响查询优化器选择的计划以及该计划的调度时间以便执行。

The other main research thrust on this front was the incorporation of tertiary storage into a more typical relational database stack, which was the subject of Sunita Sarawagi’s Ph.D. thesis. The main theme was to change the scale at which you think about managing space (i.e., data in storage and the memory hierarchy) and time (coordinating query and cache scheduling to minimize undesirable I/Os). One of the key issues in that work was to store and retrieve large multidimensional arrays in tertiary storage—echoing work in multidimensional indexing, the basic ideas included breaking up the array into chunks and storing chunks together that are fetched together—including replicating chunks to enable multiple physical “neighbors” for a given chunk of data. A second issue was to think about how disk becomes a cache for tertiary storage. Finally, query optimization and scheduling had to take into account the long load times of tertiary storage and the importance of “hits” in the disk cache—this affects both the plan chosen by a query optimizer, and the time at which that plan is scheduled for execution.

磁带和光盘机械手目前应用并不广泛。但三级存储的问题在云中非常普遍,云在 2018 年具有深层存储层次结构:从附加固态磁盘到可靠的类磁盘存储服务(例如 AWS EBS)到归档存储(例如 AWS S3)再到深度存储(例如AWS Glacier)。如今,这些存储层仍然是相对独立的,并且几乎没有数据库支持来推理跨这些层的存储。如果 Postgres 在这方面探讨的问题在短期内被重新审视,我不会感到惊讶。

Tape and optical disk robots are not widely used at present. But the issues of tertiary storage are very prevalent in the cloud, which has deep storage hierarchies in 2018: from attached solid-state disks to reliable disk-like storage services (e.g., AWS EBS) to archival storage (e.g., AWS S3) to deep storage (e.g., AWS Glacier). It is still the case today that these storage tiers are relatively detached, and there is little database support for reasoning about storage across these tiers. I would not be surprised if the issues explored on this front in Postgres are revisited in the near term.

支持多处理器:XPRS

Support for Multiprocessors: XPRS

Stonebraker 从未构建过大型并行数据库系统,但他领导了该领域的许多激动人心的讨论。他的“Case for Shared Nothing”论文 [Stonebraker 1986d] 记录了该地区粗粒度的建筑选择;它普及了业界使用的术语,并为 Gamma 和 Teradata 等无共享架构提供了支持,这些架构在 2000 年代被大数据人群重新发现。

Stonebraker never architected a large parallel database system, but he led many of the motivating discussions in the field. His “Case for Shared Nothing” paper [Stonebraker 1986d] documented the coarse-grained architectural choices in the area; it popularized the terminology used by the industry and threw support behind shared-nothing architectures like those of Gamma and Teradata, which were rediscovered by the Big Data crowd in the 2000s.

具有讽刺意味的是,Stonebraker 对并行数据库领域最实质性的贡献是一种称为 XPRS 的“共享内存”架构,它代表 RAID 和 Sprite 上的扩展 Postgres。XPRS 是 20 世纪 90 年代初伯克利系统的“正义联盟”:Stonebraker 的 Postgres 系统、John Ousterhout 的 Sprite 分布式操作系统以及 Dave Patterson 和 Randy Katz 的 RAID 存储架构的简短组合。与许多多学院的努力一样,XPRS 的执行实际上是由参与该项目的研究生决定的。主要贡献者最终是魏红,他撰写了他的博士论文。关于 XPRS 中并行查询优化的论文。因此,XPRS 对文献和行业的主要贡献是并行查询优化,没有真正考虑涉及 RAID 或 Sprite 的问题。8

Ironically, Stonebraker’s most substantive contribution to the area of parallel databases was a “shared memory” architecture called XPRS, which stood for eXtended Postgres on RAID and Sprite. XPRS was the “Justice League” of Berkeley systems in the early 1990s: a brief combination of Stonebraker’s Postgres system, John Ousterhout’s Sprite distributed OS, and Dave Patterson’s and Randy Katz’s RAID storage architectures. Like many multi-faculty efforts, the execution of XPRS was actually determined by the grad students who worked on it. The primary contributor ended up being Wei Hong, who wrote his Ph.D. thesis on parallel query optimization in XPRS. Hence, the main contribution of XPRS to the literature and industry was parallel query optimization, with no real consideration of issues involving RAID or Sprite.8

原则上,并行性通过将查询优化期间所做的传统选择(数据访问、连接算法、连接顺序)与并行化每个选择的所有可能方式相乘,从而“炸毁”查询优化器的计划空间。Stonebraker 所谓的“伟红优化器”的基本思想是将问题一分为二:以 System R 的风格运行传统的单节点查询优化器,然后通过调度“并行化”生成的单节点查询计划基于数据布局和系统配置的并行度和每个运算符的放置。这种方法是启发式的,但它使并行性成为传统查询优化的附加成本,而不是乘法成本。

In principle, parallelism “blows up” the plan space for a query optimizer by making it multiply the traditional choices made during query optimization (data access, join algorithms, join orders) against all possible ways of parallelizing each choice. The basic idea of what Stonebraker called “The Wei Hong Optimizer” was to cut the problem in two: Run a traditional single-node query optimizer in the style of System R, and then “parallelize” the resulting single-node query plan by scheduling the degree of parallelism and placement of each operator based on data layouts and system configuration. This approach is heuristic, but it makes parallelism an additive cost to traditional query optimization, rather than a multiplicative cost.

虽然“伟宏优化器”是在Postgres环境下设计的,但它成为了业界许多并行查询优化器的标准方法。

Although “The Wei Hong Optimizer” was designed in the context of Postgres, it became the standard approach for many of the parallel query optimizers in industry.

支持多种语言模型

Support for a Variety of Language Models

自 Ingres 时代以来,Stonebraker 反复出现的兴趣之一就是数据库系统的程序员 API。在他的数据库系统读物系列中,他经常将 Carlo Zaniolo 的 GEM 语言等著作作为数据库系统爱好者需要理解的重要主题。这种对语言的兴趣无疑促使他与 Larry Rowe 合作开发 Postgres,这反过来又深刻地影响了 Postgres 数据模型及其对象关系方法的设计。他们的工作主要集中在商业领域中以数据为中心的应用程序,包括业务处理和新兴应用程序,例如 CAD/CAM 计算机辅助设计(和制造)和地理信息系统 (GIS)。

One of Stonebraker’s recurring interests since the days of Ingres was the programmer API to a database system. In his Readings in Database Systems series, he frequently included work like Carlo Zaniolo’s GEM language as important topics for database system aficionados to understand. This interest in language undoubtedly led him to partner up with Larry Rowe on Postgres, which in turn deeply influenced the design of the Postgres data model and its Object-Relational approach. Their work focused largely on data-centric applications they saw in the commercial realm, including both business processing and emerging applications like CAD/CAM computer-aided design (and manufacturing) and Geographic Information System (GIS).

当时 Stonebraker 面临的一个问题是“隐藏”编程语言结构和数据库存储之间的边界。探索面向对象数据库 (OODB) 的各种相互竞争的研究项目和公司都瞄准了命令式面向对象编程语言(如 Smalltalk、C++ 和 Java)与声明性关系模型之间所谓的“阻抗不匹配”。OODB 的想法是让编程语言对象可以选择性地标记为“持久”,并由嵌入式 DBMS。Postgres 支持存储嵌套对象和 ADT,但其关系型声明式查询接口意味着每次到数据库的往返对于程序员来说都是不自然的(需要转向声明式查询)并且执行成本高昂(需要查询解析和优化)。为了与 OODB 供应商竞争,Postgres 公开了一个所谓的“快速路径”接口:基本上是数据库存储内部的 C/C++ API。这使得 Postgres 在学术 OODB 基准测试中具有中等性能,但从未真正解决允许使用多种语言的程序员避免阻抗不匹配问题的挑战。相反,Stonebraker 将 Postgres 模型称为“对象关系”,并简单地将 OODB 工作负载视为“零十亿美元”市场。今天,

One issue that was forced upon Stonebraker at the time was the idea of “hiding” the boundary between programming language constructs and database storage. Various competing research projects and companies exploring Object-Oriented Databases (OODBs) were targeting the so-called “impedance mismatch” between imperative object-oriented programming languages like Smalltalk, C++, and Java, and the declarative relational model. The OODB idea was to make programming language objects be optionally marked “persistent,” and handled automatically by an embedded DBMS. Postgres supported storing nested objects and ADTs, but its relational-style declarative query interface meant that each round trip to the database was unnatural for the programmer (requiring a shift to declarative queries) and expensive to execute (requiring query parsing and optimization). To compete with the OODB vendors, Postgres exposed a so-called “Fast Path” interface: basically, a C/C++API to the storage internals of the database. This enabled Postgres to be moderately performant in academic OODB benchmarks, but never really addressed the challenge of allowing programmers in multiple languages to avoid the impedance mismatch problem. Instead, Stonebraker branded the Postgres model as “Object-Relational” and simply sidestepped the OODB workloads as a “zero-billion-dollar” market. Today, essentially all commercial relational database systems are “Object-Relational” database systems.

事实证明这是一个明智的决定。如今,没有任何 OODB 产品以其预想的形式存在,编程语言中“持久对象”的概念也基本上被抛弃了。相比之下,对象关系映射层的广泛使用(由 Java Hibernate 和 Ruby on Rails 等早期工作推动)允许声明性数据库作为库以相对无缝的方式隐藏在几乎任何命令式面向对象编程语言下。方式。这种应用程序级方法不同于 OODB 和 Stonebraker 的对象关系数据库定义。此外,轻量级持久键值存储在非事务性和事务性形式上也取得了成功。这些是由 Stonebraker 博士首创的。学生 Margo Seltzer,她在博士论文中编写了 BerkeleyDB。

This proved to be a sensible decision. Today, none of the OODB products exist in their envisioned form, and the idea of “persistent objects” in programming languages has largely been discarded. By contrast, there is widespread usage of object-relational mapping layers (fueled by early efforts like Java Hibernate and Ruby on Rails) that allow declarative databases to be tucked under nearly any imperative object-oriented programming language as a library, in a relatively seamless way. This application-level approach is different than both OODBs and Stonebraker’s definition of Object-Relational DBs. In addition, lightweight persistent key-value stores have succeeded as well, in both non-transactional and transactional forms. These were pioneered by Stonebraker’s Ph.D. student Margo Seltzer, who wrote BerkeleyDB as part of her Ph.D. thesis at the same time as the Postgres group, which presaged the rise of distributed “NoSQL” key-value stores like Dynamo, MongoDB, and Cassandra.

软件影响

Software Impact

开源

Open Source

Postgres 始终是一个稳定发布的开源项目,但在最初的几年里,它的目标是在研究中使用,而不是在生产中使用。

Postgres was always an open-source project with steady releases, but in its first many years it was targeted at usage in research, not in production.

随着 Postgres 研究项目的结束,Stonebraker 小组的两名学生——Andrew Yu 和 Jolly Chen——修改了系统的解析器,以接受 SQL 的可扩展变体,而不是原始的 PostQuel 语言。第一个支持 SQL 的 Postgres 版本是 Postgres95;下一个被称为 PostgreSQL。

As the Postgres research project was winding down, two students in Stonebraker’s group—Andrew Yu and Jolly Chen—modified the system’s parser to accept an extensible variant of SQL rather than the original PostQuel language. The first Postgres release supporting SQL was Postgres95; the next was dubbed PostgreSQL.

一群开源开发人员对 PostgreSQL 产生了兴趣并“采用”了它,尽管伯克利团队的其他成员正在转向其他兴趣。超过一直以来,PostgreSQL 的核心开发人员都保持相当稳定,开源项目也已经非常成熟。早期的工作重点是代码稳定性和面向用户的功能,但随着时间的推移,开源社区也对系统的核心进行了重大修改和改进,从优化器到访问方法以及核心交易和存储系统。自 20 世纪 90 年代中期以来,很少有 PostgreSQL 内部人员来自伯克利的学术小组——最后的贡献可能是我在 90 年代后半叶的 GiST 实现——但即便如此,它也被开源软件重写和清理了——来源志愿者(在这种情况下来自俄罗斯)。

A set of open-source developers became interested in PostgreSQL and “adopted” it even as the rest of the Berkeley team was moving on to other interests. Over time, the core developers for PostgreSQL have remained fairly stable, and the open-source project has matured enormously. Early efforts focused on code stability and user-facing features, but over time the open-source community made significant modifications and improvements to the core of the system as well, from the optimizer to the access methods and the core transaction and storage system. Since the mid-1990s, very few of the PostgreSQL internals came out of the academic group at Berkeley—the last contribution may have been my GiST implementation in the latter half of the 1990s—but even that was rewritten and cleaned up substantially by open-source volunteers (from Russia, in that case). The open source community around PostgreSQL deserves enormous credit for running a disciplined process that has soldiered on over decades to produce a remarkably high-impact and long-running project.

虽然 25 年来发生了很多变化,但 PostgreSQL 的基本架构仍然与 20 世纪 90 年代初期的 Postgres 大学版本非常相似,熟悉当前 PostgreSQL 源代码的开发人员在浏览 Postgres3.1 源代码时不会遇到任何困难(约 1991 年)。从源代码目录结构到进程结构再到数据结构的所有内容都非常相似。Berkeley Postgres 团队的代码具有出色的骨架。

While many things have changed in 25 years, the basic architecture of PostgreSQL remains quite similar to the university releases of Postgres in the early 1990s, and developers familiar with the current PostgreSQL source code would have little trouble wandering through the Postgres3.1 source code (c. 1991). Everything from source code directory structures to process structures to data structures remain remarkably similar. The code from the Berkeley Postgres team had excellent bones.

毫无疑问,当今的 PostgreSQL 是功能最强的开源 DBMS,支持商业产品中经常缺少的功能。它也是(根据一个有影响力的排名网站)世界上最受欢迎的广泛使用的独立开源数据库9,并且其影响力持续增长:2017年,它是世界上流行度增长最快的数据库系统。10 PostgreSQL 广泛用于各种行业和应用程序,考虑到其广泛功能的野心,这也许并不奇怪;PostgreSQL 网站在http://www.postgresql.org/about/users/上列出了一些用途。(上次访问时间为 2018 年 1 月 22 日。)

PostgreSQL today is without question the most high-function open-source DBMS, supporting features that are often missing from commercial products. It is also (according to one influential rankings site) the most popular widely used independent open-source database in the world9 and its impact continues to grow: In 2017 it was the fastest-growing database system in the world in popularity.10 PostgreSQL is used across a wide variety of industries and applications, which is perhaps not surprising given its ambition of broad functionality; the PostgreSQL website catalogs some of the uses at http://www.postgresql.org/about/users/. (Last accessed January 22, 2018.)

Heroku 是一家云 SaaS 提供商,现已成为 Salesforce 的一部分。Heroku 于 2010 年采用 Postgres 作为其平台的默认数据库。Heroku 选择了Postgres 因为它的运行可靠性。在 Heroku 的支持下,更多主要应用程序框架(例如 Ruby on Rails 和 Python for Django)开始推荐 Postgres 作为默认数据库。

Heroku is a cloud SaaS provider that is now part of Salesforce. Postgres was adopted by Heroku in 2010 as the default database for its platform. Heroku chose Postgres because of its operational reliability. With Heroku’s support, more major application frameworks such as Ruby on Rails and Python for Django began to recommend Postgres as their default database.

PostgreSQL 现在支持一个扩展框架,可以轻松地通过 UDF 和相关修改向系统添加附加功能。现在有一个 PostgreSQL 扩展的生态系统——类似于 Data-Blades 的 Illustra 愿景,但是是开源的。一些更有趣的扩展包括用于 SQL 机器学习的 Apache MADlib 库和用于并行查询执行的 Citus 库。

PostgreSQL today supports an extension framework that makes it easy to add additional functionality to the system via UDFs and related modifications. There is now an ecosystem of PostgreSQL extensions—akin to the Illustra vision of Data-Blades, but in open source. Some of the more interesting extensions include the Apache MADlib library for machine learning in SQL, and the Citus library for parallel query execution.

在 Postgres 上构建的最有趣的开源应用程序之一是 PostGIS 地理信息系统,它利用了 Postgres 中的许多功能,正是这些功能最初激发了 Stonebraker 启动该项目。

One of the most interesting open-source applications built over Postgres is the PostGIS Geographic Information System, which takes advantage of many of the features in Postgres that originally inspired Stonebraker to start the project.

商业改编

Commercial Adaptations

鉴于其宽松的开源许可证、强大的代码库、灵活性和功能广度,PostgreSQL 长期以来一直是构建商业数据库系统的一个有吸引力的起点。将下面列出的收购价格相加,Postgres 已促成超过 26 亿美元的收购。11许多基于 PostgreSQL 的商业努力已经解决了其可能的关键限制:横向扩展至并行、无共享架构的能力。12

PostgreSQL has long been an attractive starting point for building commercial database systems, given its permissive ope- source license, its robust codebase, its flexibility, and breadth of functionality. Summing the acquisition prices listed below, Postgres has led to over $2.6 billion in acquisitions.11 Many of the commercial efforts that built on PostgreSQL have addressed what is probably its key limitation: the ability to scale out to a parallel, shared-nothing architecture.12

1. Illustra 是 Stonebraker 的第二家大型初创公司,成立于 1992 年,旨在将 Postgres 商业化,就像 RTI 将 Ingres 商业化一样。13创始团队包括 Postgres 的一些核心团队,其中包括最近的博士生。校友 Wei Hong 和当时的首席程序员 Jeff Meredith,以及 Ingres 校友 Paula Hawthorn 和 Michael Ubell。Postgres MS 学生 Mike Olson 在公司成立后不久就加入了,作为我博士学位的一部分,我致力于 Illustra 处理优化昂贵函数的工作。工作。Illustra 做了三项主要工作:扩展 SQL92 以支持用户定义的类型和函数,如 PostQuel 中的那样;使 Postgres 代码库足够强大以用于商业用途;以及通过“DataBlades、 ” 数据类型和函数的特定领域插件组件(参见第 25 章)。Illustra 于 1997 年被 Informix 以估计 4 亿美元的价格收购,14其 DataBlade 架构被集成到更成熟的 Informix 查询处理代码库中,成为 Informix Universal Server。

1. Illustra was Stonebraker’s second major start-up company, founded in 1992, seeking to commercialize Postgres as RTI had commercialized Ingres.13 The founding team included some of the core Postgres team including recent Ph.D. alumnus Wei Hong and then-chief programmer Jeff Meredith, along with Ingres alumni Paula Hawthorn and Michael Ubell. Postgres M.S. student Mike Olson joined shortly after the founding, and I worked on the Illustra handling of optimizing expensive functions as part of my Ph.D. work. There were three main efforts in Illustra: to extend SQL92 to support user-defined types and functions as in PostQuel, to make the Postgres code base robust enough for commercial use, and to foster the market for extensible database servers via examples of “DataBlades,” domain-specific plug-in components of data types and functions (see Chapter 25). Illustra was acquired by Informix in 1997 for an estimated $400M,14 and its DataBlade architecture was integrated into a more mature Informix query processing codebase as Informix Universal Server.

2. Netezza 是一家成立于 1999 年的初创公司,它分叉了 PostgreSQL 代码库,在基于定制现场可编程门阵列的硬件上构建高性能并行查询处理引擎。Netezza作为一家独立公司相当成功,并于2007年进行了IPO。最终被IBM收购,价值$1.7B。15

2.  Netezza was a startup founded in 1999, which forked the PostgreSQL codebase to build a high-performance parallel query processing engine on custom field-programmable-gate-array-based hardware. Netezza was quite successful as an independent company and had its IPO in 2007. It was eventually acquired by IBM, with a value of $1.7B.15

3. Greenplum 是第一个提供无共享并行、横向扩展版本的 PostgreSQL 的公司。Greenplum 成立于 2003 年,从公共 PostgreSQL 发行版分叉出来,但在很大程度上维护了 PostgreSQL 的 API,包括用户定义函数的 API。除了并行化之外,Greenplum 还使用替代的高性能压缩列式存储引擎和名为 Orca 的并行规则驱动查询优化器扩展了 PostgreSQL。Greenplum 于 2010 年被 EMC 以约 3 亿美元的价格收购;2012 年,EMC 将 Greenplum 并入其子公司 Pivotal。2015 年,Pivotal 选择将 Greenplum 和 Orca 重新开源。Greenplum 利用其 Postgres API 所做的努力之一是用于 SQL 机器学习的 MADlib 库;MADlib 在 PostgreSQL 中单线程运行,并在 Greenplum 上并行运行。MADlib 今天作为 Apache 项目继续存在。另一个有趣的开源项目基于Greenplum 上使用的是 Apache HAWQ,这是一种 Pivotal 设计,它以解耦的方式在大数据存储(例如 Hadoop 文件系统)上运行 Greenplum 的“上半部分”(即并行 PostgreSQL 查询处理器和可扩展性 API)。

3.  Greenplum was the first effort to offer a shared-nothing parallel, scale-out version of PostgreSQL. Founded in 2003, Greenplum forked from the public PostgreSQL distribution, but maintained the APIs of PostgreSQL to a large degree, including the APIs for user-defined functions. In addition to parallelization, Greenplum extended PostgreSQL with an alternative high-performance compressed columnar storage engine and a parallelized rule-driven query optimizer called Orca. Greenplum was acquired by EMC in 2010 for an estimated $300M; in 2012, EMC consolidated Greenplum into its subsidiary, Pivotal. In 2015, Pivotal chose to release Greenplum and Orca back into open source. One of the efforts at Greenplum that leveraged its Postgres API was the MADlib library for machine learning in SQL; MADlib runs single-threaded in PostgreSQL and in parallel over Greenplum. MADlib lives on today as an Apache project. Another interesting open-source project based on Greenplum is Apache HAWQ, a Pivotal design that runs the “top half” of Greenplum (i.e., the parallelized PostgreSQL query processor and extensibility APIs) in a decoupled fashion over Big Data stores such as the Hadoop File System.

4. EnterpriseDB 成立于 2004 年,是一家基于开源的企业,销售 PostgreSQL 的普通版和增强版,并为企业客户提供相关服务。增强的 EnterpriseDB Advanced Server 的一个关键特性是一组与 Oracle 的数据库兼容性特性,允许应用程序从 Oracle 迁移。

4.  EnterpriseDB was founded in 2004 as an open-source-based business, selling PostgreSQL in both a vanilla and enhanced edition with related services for enterprise customers. A key feature of the enhanced EnterpriseDB Advanced Server is a set of database compatibility features with Oracle to allow application migration off of Oracle.

5. Aster Data 由两名斯坦福大学学生于 2005 年创立,旨在构建并行分析引擎。其核心单节点引擎基于PostgreSQL。Aster 专注于图形查询和基于 UDF 的分析包,这些 UDF 可以使用 SQL 或 MapReduce 接口进行编程。Aster Data 于 2011 年被 Teradata 以 2.63 亿美元收购。16尽管 Teradata 从未将 Aster 集成到其核心并行数据库引擎中,但它仍然将 Aster 作为独立产品,用于 Teradata 仓储市场核心之外的用例。

5.  Aster Data was founded in 2005 by two Stanford students to build a parallel engine for analytics. Its core single-node engine was based on PostgreSQL. Aster focused on queries for graphs and on analytics packages based on UDFs that could be programmed with either SQL or MapReduce interfaces. Aster Data was acquired by Teradata in 2011 for $263M.16 While Teradata never integrated Aster into its core parallel database engine, it still maintains Aster as a standalone product for use cases outside the core of Teradata’s warehousing market.

6. ParAccel 成立于 2006 年,销售 PostgreSQL 的无共享并行版本,具有面向列的无共享存储。ParAccel 通过针对具有多个连接的查询的新启发式方法增强了 Postgres 优化器。2011年,亚马逊投资了ParAccel,并于2012年推出了AWS Redshift,这是一个基于ParAccel技术的公共云托管数据仓库即服务。2013 年,ParAccel 被 Actian(该公司还收购了 Ingres)收购,收购金额未公开——这意味着这对 Actian 来说并不是一笔实质性支出。与此同时,AWS Redshift 为 Amazon 带来了巨大的成功,多年来它一直是 AWS 上增长最快的服务,许多人认为它即将让 Teradata 和 Oracle Exadata 等长期数据仓库产品停业。从这个意义上说,Postgres可能会在云领域实现最终的统治地位。

6.  ParAccel was founded in 2006, selling a shared-nothing parallel version of PostgreSQL with column-oriented, shared-nothing storage. ParAccel enhanced the Postgres optimizer with new heuristics for queries with many joins. In 2011, Amazon invested in ParAccel, and in 2012 announced AWS Redshift, a hosted data warehouse as a service in the public cloud based on ParAccel technology. In 2013, ParAccel was acquired by Actian (which also had acquired Ingres) for an undisclosed amount—meaning it was not a material expense for Actian. Meanwhile, AWS Redshift has been an enormous success for Amazon—for many years it was the fastest-growing service on AWS, and many believe it is poised to put long-time data warehousing products like Teradata and Oracle Exadata out of business. In this sense, Postgres may achieve its ultimate dominance in the cloud.

7. CitusDB 成立于 2010 年,提供 PostgreSQL 的无共享并行实现。虽然它最初是 PostgreSQL 的一个分支,但截至 2016 年,CitusDB 是通过公共 PostgreSQL 扩展 API 实现的,并且可以安装到普通 PostgreSQL 安装中。此外,截至 2016 年,CitusDB 扩展已开源。

7.  CitusDB was founded in 2010 to offer a shared-nothing parallel implementation of PostgreSQL. While it started as a fork of PostgreSQL, as of 2016 CitusDB is implemented via public PostgreSQL extension APIs and can be installed into a vanilla PostgreSQL installation. Also, as of 2016, the CitusDB extensions are available in open source.

教训

Lessons

您可以从 Postgres 的成功中汲取大量经验教训,其中许多经验挑战了传统智慧。

You can draw a host of lessons from the success of Postgres, a number of them defiant of conventional wisdom.

我学到的最重要的教训来自于 Postgres 违背了 Fred Brooks 的“第二系统效应”这一事实。布鲁克斯认为,设计师经常在第一个成功的系统之后推出第二个系统,而第二个系统由于功能和想法负担过重而失败。Postgres 是 Stonebraker 的第二个系统,它当然充满了功能和想法。然而,该系统成功地对许多想法进行了原型设计,同时提供了一个软件基础设施,该基础设施将许多想法带到了成功的结果。这并非偶然——从根本上来说,Postgres 是为可扩展性而设计的,而且这个设计是合理的。将可扩展性作为架构核心,就可以发挥创造力,而无需过多担心纪律:您可以尝试多种扩展,让强者成功。如果做得好,“第二系统”不会注定失败;它受益于在第一个系统期间形成的信心、喜爱的项目和雄心。这是来自更“面​​向服务器”的数据库软件工程学院的早期架构课程,它挑战了“面向组件”的操作系统软件工程学院的传统智慧。

The highest-order lesson I draw comes from the fact that Postgres defied Fred Brooks’ “Second System Effect.” Brooks argued that designers often follow up on a successful first system with a second system that fails due to being overburdened with features and ideas. Postgres was Stonebraker’s second system, and it was certainly chock full of features and ideas. Yet the system succeeded in prototyping many of the ideas while delivering a software infrastructure that carried a number of the ideas to a successful conclusion. This was not an accident—at base, Postgres was designed for extensibility, and that design was sound. With extensibility as an architectural core, it is possible to be creative and stop worrying so much about discipline: You can try many extensions and let the strong succeed. Done well, the “second system” is not doomed; it benefits from the confidence, pet projects, and ambitions developed during the first system. This is an early architectural lesson from the more “server-oriented” database school of software engineering, which defies conventional wisdom from the “component oriented” operating systems school of software engineering.

另一个教训是,广泛的关注点——“一刀切”——可以成为研究和实践的制胜方法。为了创造一些名字,“MIT Stonebraker”在 2000 年代初期的数据库世界中引起了很大的轰动,“一刀切”。在这个旗帜下,他发起了一系列有影响力的项目和初创公司,但没有一个能达到 Postgres 的范围。“伯克利·斯通布雷克”似乎违背了“麻省理工学院斯通布雷克”后来的智慧,我对此没有异议。17 号当然,“一刀切”的座右铭是有道理的(定制设计总是有可能找到适度的市场!),但是“Berkeley Stonebraker”签名系统的成功——远远超出了其最初的意图——表明了绝大多数数据库问题都可以通过良好的通用架构得到很好的解决。此外,该架构的设计本身就是一项技术挑战和成就。最后,就像大多数科学和工程辩论一样,做事的好方法不只有一种。两个碎石者都有教给我们的教训。但从根本上来说,我仍然支持“伯克利·斯通布雷克”所倡导的更广泛的议程。

Another lesson is that a broad focus—“one size fits many”—can be a winning approach for both research and practice. To coin some names, “MIT Stonebraker” made a lot of noise in the database world in the early 2000s that “one size doesn’t fit all.” Under this banner he launched a flotilla of influential projects and startups, but none took on the scope of Postgres. It seems that “Berkeley Stonebraker” defies the later wisdom of “MIT Stonebraker,” and I have no issue with that.17 Of course there’s wisdom in the “one size doesn’t fit all” motto (it’s always possible to find modest markets for custom designs!), but the success of “Berkeley Stonebraker’s” signature system—well beyond its original intents—demonstrates that a broad majority of database problems can be solved well with a good general-purpose architecture. Moreover, the design of that architecture is a technical challenge and accomplishment in its own right. In the end—as in most science and engineering debates—there isn’t only one good way to do things. Both Stonebrakers have lessons to teach us. But at the base, I’m still a fan of the broader agenda that “Berkeley Stonebraker” embraced.

我从 Postgres 学到的最后一个教训是,开源研究可能带来不可预测的潜力。在他的图灵演讲中,Stonebraker 谈到了 PostgreSQL 在开源领域取得成功的“机缘”,这主要是通过 Stonebraker 自己领域之外的人实现的。这是一句非常谦虚的引言:

A final lesson I take from Postgres is the unpredictable potential that can come from open-sourcing your research. In his Turing talk, Stonebraker speaks about the “serendipity” of PostgreSQL succeeding in open source, largely via people outside Stonebraker’s own sphere. It’s a wonderfully modest quote:

[A] 志愿者团队,他们中的任何一个都与我或伯克利没有任何关系,自 1995 年以来一直在管理这个开源系统。您从网上获取的 Postgres 系统就来自于这个pick-up团队。它是最好的开源软件,我只想提一下,我与此无关,我们都对这些人怀有巨大的感激之情。18

[A] pick-up team of volunteers, none of whom have anything to do with me or Berkeley, have been shepherding that open-source system ever since 1995. The system that you get off the web for Postgres comes from this pick-up team. It is open source at its best and I want to just mention that I have nothing to do with that and that collection of folks we all owe a huge debt of gratitude to.18

我确信我们所有编写过开源代码的人都会喜欢这种“机缘巧合”。但这并不全是机缘巧合——这种好运的根源无疑在于 Stonebraker 对这个项目的雄心、广度和愿景,以及他指导的构建 Postgres 原型的团队。如果说有什么教训的话,那就是“做一些重要的事情并让它自由”。在我看来(使用碎石主义)你不能跳过该课程的任何一部分。

I’m sure all of us who have written open source would love for that kind of “serendipity” to come our way. But it’s not all serendipity—the roots of that good luck were undoubtedly in the ambition, breadth, and vision that Stonebraker had for the project, and the team he mentored to build the Postgres prototype. If there’s a lesson there, it might be to “do something important and set it free.” It seems to me (to use a Stonebrakerism) that you can’t skip either part of that lesson.

致谢

Acknowledgments

我要感谢我的 Postgres 老朋友 Wei Hong、Jeff Meredith 和 Mike Olson,感谢他们的纪念和贡献,也感谢 Craig Kerstiens 对现代 PostgreSQL 的贡献。

I’m indebted to my old Postgres buddies Wei Hong, Jeff Meredith, and Mike Olson for their remembrances and input, and to Craig Kerstiens for his input on modern-day PostgreSQL.

1 . Postgres 对可扩展访问方法的挑战启发了我在研究生院结束时的第一个研究项目:广义搜索树 (GiST) [Hellerstein 等人,2017]。1995]和随后的可索引性理论概念[Hellerstein et al。2002]。我在博士后学期期间在 Postgres 中实现了 GiST,这使得在 Postgres 中添加新的索引逻辑变得更加容易。Marcel Kornacker 在 Berkeley 的论文以模板化的方式解决了 GiST 中可扩展索引带来的并发和恢复难题 [Kornacker et al. 2017]。1997]。

1. The Postgres challenge of extensible access methods inspired one of my first research projects at the end of graduate school: the Generalized Search Trees (GiST) [Hellerstein et al. 1995] and subsequent notion of Indexability theory [Hellerstein et al. 2002]. I implemented GiST in Postgres during a postdoc semester, which made it even easier to add new indexing logic in Postgres. Marcel Kornacker’s thesis at Berkeley solved the difficult concurrency and recovery problems raised by extensible indexing in GiST in a templated way [Kornacker et al. 1997].

2 . 当我开始读研究生时,这是斯通布雷克在他办公室的黑板上写下的三个主题之一,作为我攻读博士学位时需要考虑的选项。话题。我认为第二个是函数索引,但我不记得第三个了。

2. When I started grad school, this was one of three topics that Stonebraker wrote on the board in his office as options for me to think about for a Ph.D. topic. I think the second was function indexing, but I cannot remember the third.

3 . 讽刺的是,我在研究生院的代码被一位名叫 Neil Conway 的年轻开源黑客从 PostgreSQL 源代码树中完全删除了,几年后他开始攻读博士学位。和我一起在加州大学伯克利分校,现在是斯通布雷克的博士之一。孙子们。

3. Ironically, my code from grad school was fully deleted from the PostgreSQL source tree by a young open-source hacker named Neil Conway, who some years later started a Ph.D. with me at UC Berkeley and is now one of Stonebraker’s Ph.D. grandchildren.

4 . Datalog 作为声明性语言的数学基础得以保留,并随着时间的推移在包括软件定义网络和编译器在内的多个计算领域得到了应用。Datalog 是“强力”的声明式查询,作为一种完全表达性的编程模型。我最终被它作为一种自然的设计选择所吸引,并在传统数据库系统之外的各种应用环境中追求它。

4. Datalog survived as a mathematical foundation for declarative languages and has found application over time in multiple areas of computing including software-defined networks and compilers. Datalog is declarative querying “on steroids” as a fully expressive programming model. I was eventually drawn into it as a natural design choice and have pursued it in a variety of applied settings outside of traditional database systems.

5 . PRS2 中的行级规则代码非常棘手。在 Berkeley Postgres 档案中进行一些搜索,发现了以下源代码注释(可能来自 Spyros Potamianos),在 1991 年左右的 Postgres 3.1 版本中:

5. The code for row-level rules in PRS2 was notoriously tricky. A bit of searching in the Berkeley Postgres archives unearthed the following source code comment—probably from Spyros Potamianos—in Postgres version 3.1, circa 1991:

* DESCRIPTION:

* DESCRIPTION:

* Take a deeeeeeep breath & read. If you can avoid hacking the code

* Take a deeeeeeep breath & read. If you can avoid hacking the code

* below (i.e. if you have not been ‘‘volunteered’’ by the boss to do this

* below (i.e. if you have not been ‘‘volunteered’’ by the boss to do this

* dirty job) avoid it at all costs. Try to do something less dangerous

* dirty job) avoid it at all costs. Try to do something less dangerous

* for your (mental) health. Go home and watch horror movies on~TV.

* for your (mental) health. Go home and watch horror movies on~TV.

* Read some Lovecraft. Join the Army. Go and spend a few nights in

* Read some Lovecraft. Join the Army. Go and spend a few nights in

* people’s park. Commit suicide …

* people’s park. Commit suicide …

* Hm, you keep reading, eh? Oh, well, then you deserve what you~get.

* Hm, you keep reading, eh? Oh, well, then you deserve what you~get.

* Welcome to the gloomy labyrinth of the tuple level rule system, my

* Welcome to the gloomy labyrinth of the tuple level rule system, my

* poor hacker…

* poor hacker…

6 . 不幸的是,PostgreSQL 的事务处理速度仍然不是特别快:它对预写日志记录的接受有些半心半意。奇怪的是,PostgreSQL 团队保留了 Postgres 元组的大部分存储开销来提供多版本并发控制,这从来都不是 Berkeley Postgres 项目的目标。结果是一个存储系统可以模拟 Oracle 的快照隔离,但需要相当多的额外 I/O 开销,但不支持 Stonebraker 最初的时间旅行或简单恢复理念。

6. Unfortunately, PostgreSQL still isn’t particularly fast for transaction processing: Its embrace of write-ahead logging is somewhat half-hearted. Oddly, the PostgreSQL team kept much of the storage overhead of Postgres tuples to provide multiversion concurrency control, something that was never a goal of the Berkeley Postgres project. The result is a storage system that can emulate Oracle’s snapshot isolation with a fair bit of extra I/O overhead, but one that does not support Stonebraker’s original idea of time travel or simple recovery.

Mike Olson 指出,他的初衷是用 BerkeleyDB 项目中他自己的 B 树实现来替换 Postgres B 树实现,该项目是在 Postgres 时代在 Berkeley 开发的。但奥尔森一直没有时间。几年后,当 Berkeley DB 在 Sleepycat Corp. 获得事务支持时,Olson 试图说服(当时的)PostgreSQL 社区采用它来进行恢复,而不是不覆盖。他们拒绝了;该项目中有一位黑客迫切希望建立一个多版本的货币控制系统,并且由于该黑客愿意做这项工作,所以他赢得了这场争论。

Mike Olson notes that his original intention was to replace the Postgres B-tree implementation with his own B-tree implementation from the BerkeleyDB project, which developed at Berkeley during the Postgres era. But Olson never found the time. When Berkeley DB got transactional support years later at Sleepycat Corp., Olson tried to persuade the (then-) PostgreSQL community to adopt it for recovery, in place of no-overwrite. They declined; there was a hacker on the project who desperately wanted to build an multi-version currency control system, and as that hacker was willing to do the work, he won the argument.

尽管 PostgreSQL 存储引擎速度很慢,但这并不是系统固有的。PostgreSQL 的 Greenplum 分支集成了一个有趣的替代高性能压缩存储引擎。它是由吉姆·格雷 (Jim Gray) Tandem 团队的资深人士马特·麦克莱恩 (Matt McCline) 设计的。它也不支持时间旅行。

Although the PostgreSQL storage engine is slow, that is not intrinsic to the system. The Greenplum fork of PostgreSQL integrated an interesting alternative high-performance compressed storage engine. It was designed by Matt McCline—a veteran of Jim Gray’s team at Tandem. It also did not support time travel.

7 . Inversion 几年后,比尔·盖茨(Bill Gates)用 WinFS 来对抗同样的风车,试图通过关系数据库后端重建世界上使用最广泛的文件系统。WinFS 在 Windows 开发人员版本中提供,但从未推向市场。盖茨后来称这是他对微软最大的失望。

7. Some years after Inversion, Bill Gates tilted against this same windmill with WinFS, an effort to rebuild the most widely used filesystem in the world over a relational database backend. WinFS was delivered in developer releases of Windows but never made it to market. Gates later called this his greatest disappointment at Microsoft.

8 . 在这三个项目中,Postgres 和 RAID 都产生了巨大的影响。雪碧最为人所铭记的是孟德尔·罗森布鲁姆 (Mendel Rosenblum) 的博士论文。关于日志结构化文件系统(LFS)的论文,与分布式操作系统无关。所有三个项目都涉及磁盘存储的新想法,而不仅仅是就地改变单个副本。LFS 和 Postgres 存储管理器非常相似,都重新考虑将日志作为主存储,并且需要昂贵的后台重组。我曾经温和地向 Stonebraker 询问过 LFS 和 Postgres 之间的竞争或学术独家新闻,但我从未从他那里得到任何好的故事。也许是当时伯克利水中的某种东西。

8. Of the three projects, Postgres and RAID both had enormous impact. Sprite is best remembered for Mendel Rosenblum’s Ph.D. thesis on Log Structured File Systems (LFS), which had nothing of note to do with distributed operating systems. All three projects involved new ideas for disk storage beyond mutating single copies in place. LFS and the Postgres storage manager are rather similar, both rethinking logs as primary storage, and requiring expensive background reorganization. I once gently probed Stonebraker about rivalries or academic scoops between LFS and Postgres, but I never got any good stories out of him. Maybe it was something in the water in Berkeley at the time.

9 . 根据 DB Engines(http://db-engines.com/en/ranking。上次访问时间为 2018 年 1 月 22 日),PostgreSQL 目前是世界上第四大最受欢迎的 DBMS,仅次于 Oracle、MySQL 和 MS SQL Server,这些是企业产品(MySQL 多年前被 Oracle 收购)。有关此排名规则的讨论,请参阅http://db-engines.com/en/ranking_definition (上次访问时间为 2018 年 1 月 22 日)。

9. According to DB Engines (http://db-engines.com/en/ranking. Last accessed January 22, 2018), PostgreSQL today is the fourth most popular DBMS in the world, after Oracle, MySQL and MS SQL Server, all of which are corporate offerings (MySQL was acquired by Oracle many years ago). See http://db-engines.com/en/ranking_definition (Last accessed January 22, 2018) for a discussion of the rules for this ranking.

10 . “PostgreSQL 是 2017 年度最佳 DBMS”,DB Engines博客,2018 年 1 月 2 日。http ://db-engines.com/en/blog_post/76。上次访问时间为 2018 年 1 月 18 日。

10. “PostgreSQL is the DBMS of the Year 2017,” DB Engines blog, January 2, 2018. http://db-engines.com/en/blog_post/76. Last accessed January 18, 2018.

11 . 请注意,这是以实际交易美元为单位的衡量标准,比高科技领域经常使用的价值要大得多。数十亿的数字通常用来描述所持股票的估计价值,但为了期望未来的价值,通常会比当前价值夸大 10 倍或更多。收购的交易金额衡量公司在收购时的实际市场价值。可以公平地说,Postgres 已经创造了超过 26 亿美元的实际商业价值。

11. Note that this is a measure in real transaction dollars and is much more substantial than the values often thrown around in high tech. Numbers in the billions are often used to describe estimated value of stock holdings but are often inflated by 10× or more against contemporary value in hopes of future value. The transaction dollars of an acquisition measure the actual market value of the company at the time of acquisition. It is fair to say that Postgres has generated more than $2.6 billion of real commercial value.

12 . 并行化 PostgreSQL 需要相当多的工作,但只需一个经验丰富的小型团队就可以完成。如今,行业管理的 PostgreSQL 开源分支(例如 Greenplum 和 CitusDB)提供了此功能。遗憾的是 PostgreSQL 并没有更早地以真正的开源方式实现并行化。如果 PostgreSQL 在 2000 年代初就通过开源中的无共享功能进行了扩展,那么开源大数据运动很可能会以完全不同的方式发展,并且更加有效。

12. Parallelizing PostgreSQL requires a fair bit of work, but is eminently doable by a small, experienced team. Today, industry-managed open-source forks of PostgreSQL such as Greenplum and CitusDB offer this functionality. It is a shame that PostgreSQL wasn’t parallelized in a true open-source way much earlier. If PostgreSQL had been extended with shared-nothing features in open source in the early 2000s, it is quite possible that the open-source Big Data movement would have evolved quite differently and more effectively.

13 . Illustra 实际上是为该公司提议的第三个名称。遵循安格尔确立的绘画主题,Illustra 最初被称为 Miró。由于商标原因,名称更改为 Montage,但这也遇到了商标问题。

13. Illustra was actually the third name proposed for the company. Following the painterly theme established by Ingres, Illustra was originally called Miró. For trademark reasons the name was changed to Montage, but that also ran into trademark problems.

14 . “Informix 收购 Illustra 以进行复杂数据管理”,《联邦计算机周刊》,1996 年 1 月 7 日。http://fcw.com/Articles/1996/01/07/Informix-acquires-Illustra-for-complex-data-management.aspx。上次访问时间为 2018 年 1 月 22 日。

14. “Informix acquires Illustra for complex data management,” Federal Computer Week, January 7, 1996. http://fcw.com/Articles/1996/01/07/Informix-acquires-Illustra-for-complex-data-management.aspx. Last accessed January 22, 2018.

15 . http://en.wikipedia.org/wiki/Netezza。上次访问时间为 2018 年 1 月 22 日。

15. http://en.wikipedia.org/wiki/Netezza. Last accessed January 22, 2018.

16 . “大数据的大发薪日。Teradata 以 2.63 亿美元收购 Aster Data”,TechCrunch,2011 年 5 月 3 日。(http://techcrunch.com/2011/03/03/teradata-buys-aster-data-263-million/。上次访问时间:2018 年 1 月 22 日。

16. “Big Pay Day For Big Data. Teradata Buys Aster Data For $263 Million,” TechCrunch, May 3, 2011. (http://techcrunch.com/2011/03/03/teradata-buys-aster-data-263-million/. Last accessed January 22, 2018.

17 . 正如爱默生所说,“愚蠢的一致性是小头脑的怪物。”

17. As Emerson said, “a foolish consistency is the hobgoblin of little minds.”

18 . Jolly Chen 的文字记录,http://www.postgresql.org/message-id/A4BA155B-E762-4022-B7D1-6F4791014851@chenfamily.com。上次访问时间为 2018 年 1 月 22 日。

18. Transcript by Jolly Chen, http://www.postgresql.org/message-id/A4BA155B-E762-4022-B7D1-6F4791014851@chenfamily.com. Last accessed January 22, 2018.

17 号

17

数据库迎接流处理时代

Databases Meet the Stream Processing Era

玛格达莱娜·巴拉津斯卡、斯坦·兹多尼克

Magdalena Balazinska, Stan Zdonik

Aurora 和 Borealis 项目的起源

Origins of the Aurora and Borealis Projects

2000 年代初期,传感器和传感器网络成为系统、网络和数据库社区的重要焦点。硬件成本的下降推动了技术的发展,而对普适计算的兴奋(例如麻省理工学院的“Project Oxygen” 1等项目)在一定程度上推动了应用程序的发展。在大多数领域,需要对软件进行重大改进,以支持构建在传感器网络之上的新兴应用程序,这种需求刺激了所有这三个领域的研究。

In the early 2000s, sensors and sensor networks became an important focus in the systems, networking, and database communities. The decreasing cost of hardware was creating a technology push, while the excitement about pervasive computing, exemplified by projects such as MIT’s “Project Oxygen,”1 was in part responsible for an application pull. In most areas, dramatic improvements in software were needed to support emerging applications built on top of sensor networks, and this need was stimulating research in all these three fields.

在数据库社区中,正如许多人所观察到的那样,传统的数据库管理系统 (DBMS) 不适合支持这种新型的流处理应用程序。传统的 DBMS 是为业务数据而设计的,这些数据存储在磁盘上并由事务处理应用程序修改。然而,在新的流处理世界中,传感器或网络监视器等数据源将数据推送到数据库。需要处理数据流的应用程序2希望在发生有趣的事件时接收警报。这种从主动用户查询被动数据库到被动用户从主动数据库接收警报的转变[Abadi et al. 2003a]是一个基本范式数据库社区的转变。另一个范式转变是数据不再驻留在磁盘上,而是由应用程序以高且通常可变的速率持续推送。

In the database community, as many observed, traditional database management systems (DBMSs) were ill-suited for supporting this new type of stream-processing application. Traditional DBMSs were designed for business data, which is stored on disk and modified by transaction-processing applications. In the new stream processing world, however, data sources such as sensors or network monitors were instead pushing data to the database. Applications2 that needed to process data streams wanted to receive alerts when interesting events occurred. This switch from active users querying a passive database to passive users receiving alerts from an active database [Abadi et al. 2003a] was a fundamental paradigm shift in the database community. The other paradigm shift was that data no longer resided on disk but was being continuously pushed by applications at high, and often variable, rates.

当上述技术和应用发生变化时,Mike Stonebraker 正从西海岸搬到东海岸。他于 2000 年离开伯克利,并于 2001 年加入麻省理工学院。当时,麻省理工学院没有数据库教师,迈克发现自己正在与系统和网络团队合作。一些最接近的数据库小组位于布朗大学和布兰迪斯大学,迈克将继续在这三个机构之间建立成功的长期合作。此次合作将跨越多个研究趋势和项目(从流处理以及 Aurora 和 Borealis 项目开始),并且按照 Mike 的模型(参见第 7 章)将产生几个专门构建的 DBMS 引擎初创公司(从 StreamBase Systems 开始)。

At the time when the above technology and application changes were occurring, Mike Stonebraker was moving from the West Coast to the East Coast. He left Berkeley in 2000 and joined MIT in 2001. At that time, MIT had no database faculty and Mike found himself collaborating with systems and networking groups. Some of the closest database groups were at Brown University and Brandeis University, and Mike would go on to create a successful and long-term collaboration across the three institutions. The collaboration would span multiple research trends and projects (starting with stream processing and the Aurora and Borealis projects) and, following Mike’s model (see Chapter 7), would generate several purpose-built DBMS-engine startup companies (starting with StreamBase Systems).

Mike 和他的团队与数据库社区的其他人一起确定了传统 DBMS 在流处理应用程序方面的以下主要限制。

Mike and his team, together with others in the database community, identified the following key limitations of traditional DBMSs with regard to stream processing applications.

•  数据摄取率不足:当数据连续来自多个源时,传统的 DBMS 很难将数据写入磁盘,然后才能进行处理。

•  Insufficient data ingest rates: When data streams from multiple sources continuously, traditional DBMSs struggle to write the data to disk before making it available for processing.

•  面向磁盘:在传统的DBMS 中,数据存储在磁盘上,并且仅作为查询活动的结果缓存在内存中。流处理应用程序需要快速处理数据。他们需要数据到达时保留在内存中并由应用程序直接处理。

•  Disk orientation: In traditional DBMSs, data is stored on disk and only cached in memory as a result of query activity. Stream processing applications need to process data fast. They need the data to stay in memory as it arrives and be directly processed by applications.

•  触发器的可扩展性和性能限制:可以通过使用所谓的触发器在经典关系 DBMS 中创建警报。触发器监视表并采取操作来响应对这些表进行更改的事件。然而,触发器是事后才添加到 DBMS 中的,并且从未设计用于满足流应用程序的需求。

•  Scalability and performance limitations of triggers: It is possible to create alerts in a classical relational DBMS by using what is called a trigger. Triggers monitor tables and take actions in response to events that make changes to these tables. Triggers, however, were added to DBMSs as an after thought and were never designed to scale to the needs of streaming applications.

•  访问过去数据的困难:与关注数据库当前状态的关系查询不同,流处理查询关注时间序列。他们需要轻松访问过去的数据,以及至少一个最近数据的窗口。

•  Difficulty of accessing past data: Unlike relational queries, which focus on the current state of the database, stream processing queries are time-series-focused. They need to easily access past data, and at least a window of recent data.

•  缺少语言结构:为了支持流应用程序,SQL 必须使用语言结构进行扩展,例如不同类型的基于窗口的操作(窗口聚合和窗口连接)。

•  Missing language constructs: To support streaming applications, SQL must be extended with language constructs such as different types of window-based operations (windowed-aggregation and windowed-joins).

•  精确的查询答案:在传统的业务应用程序中,查询处理存储在磁盘上的数据并返回精确的答案。在流处理领域,输入数据可能会被延迟、丢弃或重新排序。流处理引擎必须通过清晰的语言结构和语义来捕获这些不准确之处。他们需要一种方法来确保面对延迟数据时的计算进度,在重新排序数据的情况下确保清晰的语义,以及一种处理相应计算窗口关闭后到达的数据的方法。

•  Precise query answers: In traditional business applications, queries process data stored on disk and return precise answers. In a stream processing world, input data can get delayed, dropped, or reordered. Stream processing engines must capture these inaccuracies with clear language constructs and semantics. They need a way to ensure computation progress in the face of delayed data, clear semantics in the case of reordered data, and a way to handle data arriving after the corresponding windows of computation have closed.

•  近实时要求:最后,流处理应用程序需要近实时的查询结果。特别是,即使负载条件发生变化或数据源以不同的速率生成数据,基于流中的数据生成警报的查询也不能任意落后。

•  Near real-time requirements: Finally, stream processing applications require near real-time query results. In particular, queries that generate alerts based on data in streams cannot fall arbitrarily behind even when load conditions vary or data sources generate data at different rates.

这些要求3最初是由连续跟踪士兵、设备和导弹的军事应用场景推动的,构成了未来几年数据库社区中将出现的新型流处理引擎的基础。

These requirements3—which were initially motivated by military application scenarios where soldiers, equipment, and missiles are continuously tracked—formed the foundation of the new class of stream-processing engines that would emerge in the database community over the coming years.

Aurora 和 Borealis 流处理系统4

The Aurora and Borealis Stream-Processing Systems4

麻省理工学院-布朗-布兰迪斯分校的研究人员的 Aurora 和 Borealis 项目处于“前沿”。

The MIT-Brown-Brandeis researchers were on the “bleeding edge” with their Aurora and Borealis projects.

迈克从加利福尼亚州搬到新罕布什尔州后不久,他参加了美国国家科学基金会 (NSF) 的会议。在那里,他聚集了其他参与者,他们也是新英格兰计算机科学系的数据库研究人员。Mike 有一个想法,通过从传统的被动数据/主动查询模型转向定义流处理的主动数据/被动查询模型,将数据库设置在他们的耳朵上。按照迈克的风格,他已经与许多行业关系进行了讨论,并确定了他们共同的“痛点”。他设想的流处理模型可以安抚他们中的许多人,但他需要合作者和一群学生来构建他想要的系统的可行原型。

Soon after Mike moved from California to New Hampshire, he attended a National Science Foundation (NSF) meeting. While there, he gathered other participants who also were database researchers in New England computer science departments. Mike had an idea to set databases on their ear by moving from the traditional passive data/active query model to an active data/passive query model that defined stream processing. As was Mike’s style, he had already had discussions with his many industrial connections and determined their common “pain points.” The stream processing model he envisioned could soothe many of them, but he needed collaborators and an army of students to build a workable prototype of the system he had in mind.

在迈克的领导下,由三个机构组成的研究小组启动了一项雄心勃勃的项目,旨在构建两个而不是一个流处理系统。第一个系统称为 Aurora [Balakrishnan 等人。2004,阿巴迪等人。2003b,阿巴迪等人。2003a,兹多尼克等人。2003,卡尼等人。2002],是一个单节点系统,专注于流处理的数据模型、查询语言和查询执行的基本方面。第二个系统称为 Borealis [Abadi 等人。2005,切尔尼亚克等人。2003,兹多尼克等人。2003],是一个分布式系统,专注于跨局域网和广域网的高效流处理,包括分布、负载平衡和容错挑战。Borealis 构建在单节点 Aurora 系统之上。

Under Mike’s leadership, the three-institution research group embarked on an ambitious project to build not one but two stream-processing systems. The first system, called Aurora [Balakrishnan et al. 2004, Abadi et al. 2003b, Abadi et al. 2003a, Zdonik et al. 2003, Carney et al. 2002], was a single-node system that focused on the fundamental aspects of data model, query language, and query execution for stream processing. The second system, called Borealis [Abadi et al. 2005, Cherniack et al. 2003, Zdonik et al. 2003], was a distributed system that focused on aspects of efficient stream processing across local and wide-area networks including distribution, load balance, and fault-tolerance challenges. Borealis was built on top of the single-node Aurora system.

这两个系统都取得了广泛的成功,并作为开源项目发布。他们发表了许多关于流媒体各个方面的出版物,从基本数据模型和架构到负载卸载、修订处理、高可用性和容错、负载分配和操作员调度问题[Tatbul and Zdonik 2006, Ryvkina et al. 2006]。2006,Xing 等人。2005,阿巴迪等人。2005,切尔尼亚克等人。2003,塔特布尔等人。2007,黄等人。2005,巴拉克里希南等人。2004,卡尼等人。2003,阿巴迪等人。2003b,Tatbul 等人。2003,阿巴迪等人。2003a,兹多尼克等人。2003,卡尼等人。2002,巴拉津斯卡等人。2004a,巴拉津斯卡等人。2005]。2003 年,国家科学基金会 (NSF) 向多个机构提供了一项为期五年的大型资助,使该团队能够扩大规模并积极追求如此广泛且雄心勃勃的研究议程。

Both systems were widely successful and released as open-source projects. They led to many publications on various aspects of streaming from the fundamental data model and architecture to issues of load shedding, revision processing, high availability, and fault tolerance, load distribution, and operator scheduling [Tatbul and Zdonik 2006, Ryvkina et al. 2006, Xing et al. 2005, Abadi et al. 2005, Cherniack et al. 2003, Tatbul et al. 2007, Hwang et al. 2005, Balakrishnan et al. 2004, Carney et al. 2003, Abadi et al. 2003b, Tatbul et al. 2003, Abadi et al. 2003a, Zdonik et al. 2003, Carney et al. 2002, Balazinska et al. 2004a, Balazinska et al. 2005]. A large, five-year, multi-institution NSF grant awarded in 2003 allowed the team to scale and aggressively pursue such a broad and ambitious research agenda.

Aurora 是一个创新且技术深厚的系统。基本数据模型仍然是具有一些扩展的关系模型,即关系在大小方面不受限制(即不断增长),由远程数据源推送(因此仅附加),并包含系统添加的时间戳属性到流中的每个输入元组。查询采用方框图和箭头图的形式,其中运算符通过流连接到直接非循环查询执行图中。运算符本身也是关系型的,但通过窗口结构进行了扩展,以确保面对无限输入时的非阻塞处理。

Aurora was an innovative and technically deep system. The fundamental data model remained the relational model with some extensions, namely that relations were unbounded in terms of size (i.e., continuously growing), pushed by remote data sources (and thus append-only), and included a timestamp attribute that the system added to each input tuple in a stream. Queries took the form of boxes and arrows diagrams, where operators were connected by streams into direct acyclic query execution graphs. The operators themselves were also relational but extended with windowing constructs to ensure non-blocking processing in the face of unbounded inputs.

Aurora 的编程模型和语言基于框和箭头。方框是内置运算符(例如,SELECT、JOIN、MAP),箭头是会触发下游运算符的数据流。程序是通过使用 GUI 按字面意思连接方框和箭头图来构建的。对于简单的问题,这是一种令人信服的可视化数据流逻辑的方法。对于更困难的问题,方框和箭头的方法变得难以管理。顾客抱怨没有标准。迈克召集了一个由詹妮弗组成的小组来自斯坦福大学的 Widom,STREAM 的设计者 [Motwani 等人。2003年];Oracle的几位工程师决定在Oracle的中间件产品中使用流语言;以及 StreamBase(基于 Aurora 的产品)的一些设计师提出了一种值得该团队致敬的单一文本语言。Mike 说这应该很容易,因为框和箭头语言和 STREAM 都支持 Windows 等功能,并且它应该只是合并两者的练习。这将解决这两个问题。经过进一步调查,委员会认为底层处理模型明显不同,合并过于复杂而不切实际。详细信息请参见 VLDB Endowment (PVLDB) 文章 [Jain et al 2008]。

Aurora’s programming model and language was based on boxes-and-arrows. The boxes were built-in operators (e.g., SELECT, JOIN, MAP), and the arrows were data flows that would trigger downstream operators. A program was constructed by literally wiring up a boxes-and-arrows diagram using a GUI. For simple problems, this was a compelling way to visualize the data-flow logic. For more difficult problems, the boxes-and-arrows approach became hard to manage. Customers complained that there was no standard. Mike convened a group consisting of Jennifer Widom from Stanford, the designer of STREAM [Motwani et al. 2003]; a few engineers from Oracle who had decided to use the stream language in Oracle’s middleware product; and some designers from StreamBase (the product based on Aurora) to come up with a single textual language that the group could salute. Mike said that this should be easy since both the boxes-and-arrows language and STREAM supported features like Windows, and it should just be an exercise in merging the two. This would solve both problems. Upon further investigation, the committee decided that the underlying processing model was significantly different, and a merge would be too complex to be practical. The details are in the Proceedings of the VLDB Endowment (PVLDB) article [Jain et al 2008].

有趣的是,从一开始,Aurora 的设计就包括通过使用我们所谓的连接点来连接持久存储的构造。连接点可以在磁盘上缓冲流。因此,它可以作为添加新查询的位置,并可以重新处理收集点中积累的最新数据。连接点还可以表示磁盘上的静态关系,并用于帮助将该关系与流数据连接起来。连接点的设计虽然很有趣,但有些超前于时代,多年来,大多数关注点纯粹是在没有连接点的情况下处理数据流。在现代流处理系统中,正如我们下面讨论的,持久化和重新处理流的能力是一个重要的功能。5

Interestingly, from its inception, Aurora’s design included constructs to connect with persistent storage by using what we called connection points. A connection point could buffer a stream on disk. As such, it could serve as a location where new queries could be added and could re-process recent data that had accumulated in the collection point. A connection point could also represent a static relation on disk and serve to help join that relation with streaming data. The connection point design, while interesting, was somewhat ahead of its time and for many years, most focus was purely on processing data streams without connection points. In modern stream-processing systems, as we discuss below, the ability to persist and reprocess streams is an important function.5

Aurora 包括其他创新功能,例如无序或延迟数据的构造和语义以及流的更正,以及用于查询调度、服务质量、负载卸载和容错的新颖方法 [Tatbul 和 Zdonik 2006,Tatbul 等人。2007,塔特布尔等人。2003,巴拉克里希南等人。2004,阿巴迪等人。2003b,阿巴迪等人。2003a,卡尼等人。2003,卡尼等人。2002]。

Aurora included additional innovative features such as constructs and semantics for out-of-order or late data and corrections on streams, and novel methods for query scheduling, quality of service, load shedding and fault-tolerance [Tatbul and Zdonik 2006, Tatbul et al. 2007, Tatbul et al. 2003, Balakrishnan et al. 2004, Abadi et al. 2003b, Abadi et al. 2003a, Carney et al. 2003, Carney et al. 2002].

继 Aurora 之后,Borealis 系统解决了流应用程序的分布式特性和要求。在 Borealis 中,查询计划中的运算符可以分布在集群中的多台机器上,甚至分布在广域网上,如图17.1所示。分布式对于源从远程位置高速发送数据的应用程序非常重要,例如网络监控应用程序和基于传感器的应用程序。Borealis 为分布式流处理提供了基本抽象,并包括负载管理 [Xing 等人。2005,巴拉津斯卡等人。2004a] 以及不同类型的高可用性和容错功能 [Hwang 等人。2005,巴拉津斯卡等人。2005]。

Following Aurora, the Borealis system tackled the distributed nature and requirements of streaming applications. In Borealis, operators in a query plan could be distributed across multiple machines in a cluster or even over wide area networks, as illustrated in Figure 17.1. Distribution is important for applications where sources send data at high rates from remote locations, such as network monitoring applications and sensor-based applications, for example. Borealis provided fundamental abstractions for distributed stream processing and included load management [Xing et al. 2005, Balazinska et al. 2004a] and different types of high-availability and fault-tolerance features [Hwang et al. 2005, Balazinska et al. 2005].

图像

图 17.1   Borealis 中的分布式流处理应用程序示例。此应用程序将在 1 分钟时间窗口内建立超过 100 个连接并连接超过 10 个不同端口的任何源 IP 标记为潜在网络入侵者。资料来源:[Balazinska 等人。2004a]

Figure 17.1  Example of a distributed stream-processing application in Borealis. This application labels as a potential network intruder any source IP that establishes more than 100 connections and connects over 10 different ports within a 1-min time window. Source: [Balazinska et al. 2004a]

令人难以置信的是,迈克参与了两个系统中所有系统组件的设计。他似乎有无限的时间来阅读冗长的设计文件并提供详细的评论。他会参加会议并听取每个人的想法和讨论。重要的是,他有无限的耐心来协调所有博士论文。学生在该系统上工作,并确保在 SIGMOD 和 VLDB 上定期发表文章,并在这些会议上进行高度可见的系统演示 [Ahmad 等人。2005,阿巴迪等人。2003a]。

Incredibly, Mike was involved in the design of all system components in both systems. He seemed to possess infinite amounts of time to read through long design documents and provide detailed comments. He would attend meetings and listen to everyone’s ideas and discussions. Importantly, he had infinite patience to coordinate all Ph.D. students working on the system and ensure regular publications at SIGMOD and VLDB and highly visible system demonstrations at those conferences [Ahmad et al. 2005, Abadi et al. 2003a].

通过与 Mike 合作,团队学到了一些重要的经验教训。首先,从 10,000 或 100,000 英尺的高度检查所有问题非常重要。其次,所有问题都应该在四边形图中捕获,并且右上角总是获胜,例如,参见图 17.2中的 Grassy Brook 四边形图。第三,制定新基准的方式应使旧技术看起来很糟糕。第四,数据库人员非常担心纽约的员工和巴黎的员工以及他们的工资计算方式的差异。(虽然这个例子从未出现在任何论文中,但这是迈克在向系统人员解释数据库时在黑板上画的第一个例子。)第五,流处理是可以阻止人们窃取投影仪的杀手级技术。

By working with Mike, the team learned several important lessons. First, it is important to examine all problems from either 10,000 or 100,000 feet. Second, all problems should be captured in a quad chart and the top right corner always wins, for example, see the Grassy Brook quad chart in Figure 17.2. Third, new benchmarks shall be crafted in a way that makes the old technology look terrible. Fourth, database people deeply worry about employees in New York and employees in Paris and the differences in how their salaries are computed. (While this example never made it into any paper, it was the first example that Mike would draw on a board when explaining databases to systems people.) Fifth, stream processing was the killer technology that can stop people stealing overhead projectors.

最后,也是最重要的是,团队了解到,可以将来自不同机构、不同背景、没有任何合作历史的研究生和教师放在一个房间里,让他们一起构建一个伟大的系统!

Finally—and most importantly—the team learned that one could take graduate students and faculty from different institutions, with different backgrounds, without any history of collaboration, put them in one room, and get them to build a great system together!

图像

图 17.2  高性能流处理。Spirce:StreamBase (Grassy Brook) 宣传材料 (2003)

Figure 17.2  High-performance stream processing. Spirce: StreamBase (Grassy Brook) Pitch Deck (2003)

并发流处理工作

Concurrent Stream-Processing Efforts

在 Mike 的团队构建 Aurora 和 Borealis 系统的同时,其他团队也在构建流处理原型。最突出的项目包括 STREAM [Motwani 等人。2003] 斯坦福大学的处理系统,TelegraphCQ [Chandrasekaran 等人。2003] 来自伯克利的项目,NiagaraCQ [Chen 等人。2000] 来自威斯康星州,以及 Gigascope [Cranor 等人。2003] AT&T 的项目。其他项目也在开发中,许多关于流技术各个方面的论文开始出现在 SIGMOD 和 VLDB 会议上。激烈的友好竞争,使研究工作快速推进。

At the same time as Mike’s team was building the Aurora and Borealis systems, other groups were also building stream-processing prototypes. Most prominent projects included the STREAM [Motwani et al. 2003] processing system from Stanford, the TelegraphCQ [Chandrasekaran et al. 2003] project from Berkeley, NiagaraCQ [Chen et al. 2000] from Wisconsin, and the Gigascope [Cranor et al. 2003] project from AT&T. Other projects were also under development and many papers started appearing at SIGMOD and VLDB conferences on various aspects of streaming technologies. There was intense friendly competition, which moved the research forward quickly.

总体而言,社区关注的是数据库管理系统中有效处理数据流背后的基本问题。这些问题包括开发新的数据模型和查询语言,其结构不仅用于基本数据流处理,还用于处理数据修订和无序数据以及对流执行时间旅行操作。社区还开发了新的查询优化技术,包括在不停止执行的情况下动态更改查询计划的方法。最后,几篇论文贡献了操作员调度、服务质量和负载卸载以及容错分布式和并行流处理的技术。

Overall, the community was focused on fundamental issues behind effectively processing data streams in database management systems. These issues included the development of new data models and query languages with constructs for basic data stream processing but also for processing data revisions and out-of-order data and to perform time-travel operations on streams. The community also developed new query optimization techniques including methods to dynamically change query plans without stopping execution. Finally, several papers contributed techniques for operator scheduling, quality of service and load shedding, and fault-tolerant distributed and parallel stream processing.

Aurora 和 Borealis 在当时流处理的几乎所有方面都处于领先地位。

Aurora and Borealis were among the leaders in almost all aspects of stream processing at that time.

创立 StreamBase 系统6

Founding StreamBase Systems6

Aurora 学术原型在 SIGMOD 2003 上进行了演示 [Abadi 等人。2003a],我们为出席的团队成员制作了 Aurora/Borealis 球帽。(它们现在已成为收藏家的物品。)不久之后,Mike 决定是时候将该系统商业化了,他与 Stan Zdonik、Hari Balakrishnan、Ugur Cetintemel、Mitch Cherniack、Richard Tibbetts、Jon Salz、Don Carney 共同创立了一家公司,埃迪·加尔维斯和约翰·帕特里奇。公司最初的名称是 Grassy Brook, Inc.,位于 Grassy Pond Road 后7 号,即迈克在新罕布什尔州温尼珀索基湖畔的房子所在的位置。我们准备了一份幻灯片,其中包含将提交给波士顿地区风险投资家的商业计划。幻灯片包括著名的 Stonebraker Quad 图表(见图17.2)认为 StreamBase(当时的 Grassy Brook)的最佳点是高性能流处理,这是任何传统数据管理平台都无法满足的,并且当时客户的需求很高。当我们的演示进行到一半时,风投公司显然想要投资。迈克是一名“连续创业者”,这对他很有帮助,因此是一个相对安全的赌注。

The Aurora academic prototype was demonstrated at SIGMOD 2003 [Abadi et al. 2003a], and we had Aurora/Borealis ball caps made for the team members in attendance. (They are now collectors’ items.) Not long after that, Mike decided that it was time to commercialize the system by founding a company with Stan Zdonik, Hari Balakrishnan, Ugur Cetintemel, Mitch Cherniack, Richard Tibbetts, Jon Salz, Don Carney, Eddie Galvez, and John Partridge. The initial name for the company was Grassy Brook, Inc.,7 after Grassy Pond Road, the location of Mike’s house on Lake Winnipesaukee in New Hampshire. We prepared a slide deck with the business plan that was to be presented to Boston-area venture capitalists. The slide deck included a famous Stonebraker Quad chart (see Figure 17.2) that argued that the StreamBase (then Grassy Brook) sweet spot was high-performance stream processing—something that no conventional data management platforms could accommodate and at the time coming into high demand by customers. We got halfway through the presentation when it was clear that the VCs wanted to invest. It helped that Mike was a “serial entrepreneur,” and thus a relatively safe bet.

该公司的第一个办事处位于马萨诸塞州韦尔斯利,位于我们的投资者之一 Bessemer Venture Partners 的办公空间内。后来,另一位投资者 Highland Capital 为 Grassy Brook 在马萨诸塞州列克星敦的工厂提供了一些空间(见第 9 章)。很快,该名称更改为 StreamBase,这需要从以前的所有者那里购买该名称。迈克和约翰帕特里奇为风险投资人开发了一份宣传材料,然后迈克完成了大部分工作和所有演示。完成小额初始融资后,迈克和约翰会见了潜在的早期客户,主要是金融服务人员。当他们获得足够的市场反馈来证实他们对客户需求的主张时,他们回到贝塞默和高地以获得更大的投资。

The company’s first office was in Wellesley, Massachusetts, in the space of one of our investors, Bessemer Venture Partners. Later, Highland Capital, another investor, gave Grassy Brook some space in their Lexington, Massachusetts, facility (see Chapter 9). Soon the name was changed to StreamBase, which required buying the name from a previous owner. Mike and John Partridge developed a pitch deck for the VCs, and then Mike did most of the work and all of the presenting. Once they closed on the small initial financing, Mike and John met with potential early customers, mainly financial services people. When they had enough market feedback to substantiate their claims about customer demand, they went back to Bessemer and Highland to get the larger investment.

有了新的资金,我们聘请了一个管理团队,其中包括巴里·莫里斯(Barry Morris)担任首席执行官,博比·希思(Bobbi Heath)负责工程部门,比尔·霍比卜(Bill Hobbib)负责营销部门。迈克担任首席技术官。风险投资家为我们与潜在客户的联系提供了巨大帮助。我们能够吸引到许多 CTO 的观众来自华尔街投资银行和对冲基金的 StreamBase 引擎的第一个目标市场:他们对极低延迟的需求似乎是完美的匹配。这一需求导致了许多前往纽约市的火车旅行,为调整 StreamBase 以适应要求严格的具体应用程序提供了机会。该公司参与了许多概念验证 (POC) 应用程序,虽然这是一种昂贵的销售方式,但有助于提高 StreamBase 工程师的能力。8

With fresh funding in hand, we hired a management team including Barry Morris as CEO, Bobbi Heath to run Engineering, and Bill Hobbib to run Marketing. Mike served as CTO. The venture capitalists helped enormously in connecting us with potential customers. We were able to get an audience with many CTOs from Wall Street investment banks and hedge funds, a first target market for the StreamBase engine: Their need for extremely low latency seemed like a perfect match. This need led to many train trips to New York City, opening an opportunity to tune StreamBase for a demanding and concrete application. The company engaged in many proof-of-concept (POC) applications that, while an expensive way to make sales, helped sharpen the StreamBase engineers.8

该公司试图利用政府系统打开新市场。特别是,它聘请了一个位于华盛顿的销售团队,试图渗透到情报机构。StreamBase 大力出售给“三字母机构”,In-Q-Tel(中央情报局的风险投资部门)后来投资了 StreamBase。其中一项合作伙伴是华盛顿特区的一家增值经销商 (VAR),该经销商在 StreamBase 之上为政府客户构建应用程序。StreamBase 在华盛顿有一名销售代表和一名销售工程师直接向政府销售产品,但这从未成为业务的一个有意义的部分。

The company tried to open a new market with government systems. In particular, it hired a Washington-based sales team that tried to penetrate the intelligence agencies. StreamBase made a big effort to sell to “three letter agencies” and In-Q-Tel (the CIA’s venture capital arm) later invested in StreamBase. One partnership was with a Washington, D.C., Value Added Reseller (VAR) that built applications on top of StreamBase for government customers. StreamBase had a sales representative and a sales engineer in D.C. to sell directly to the government, but that never became a meaningful part of the business.

几年后,StreamBase 被 TIBCO Software, Inc. 收购。TIBCO 至今仍在马萨诸塞州沃尔瑟姆运营,并销售 TIBCO StreamBase®。

After some years, StreamBase was acquired by TIBCO Software, Inc. TIBCO is still in operation today in Waltham, Massachusetts., and sells TIBCO StreamBase®.

当今的流处理

Stream Processing Today

近年来,“大数据”和“数据科学”的浪潮改变了行业,业务决策和产品改进越来越基于海量数据的分析结果。这些数据包括搜索日志、点击流和来自全球规模 Web 2.0 应用程序的其他“数据消耗”。在许多此类应用中,感兴趣的数据以连续的方式生成,并且用户越来越多地寻求在数据生成时实时分析数据。因此,流处理已成为工业数据处理的一个关键方面,许多系统正在积极开发中。现代流处理系统包括 Apache Kafka、Heron、Trill、Microsoft StreamInsight、Spark Streaming、Apache Beam 和 Apache Flink。

In recent years, industry has been transformed by the wave of “Big Data” and “Data Science,” where business decisions and product enhancements are increasingly based on results of the analysis of massive amounts of data. This data includes search logs, clickstreams, and other “data exhaust” from planetary-scale Web 2.0 applications. In many of these applications, the data of interest is generated in a continuous fashion and users increasingly seek to analyze the data live as it is generated. As a result, stream processing has emerged as a critical aspect of data processing in industry and many systems are being actively developed. Modern stream-processing systems include Apache Kafka, Heron, Trill, Microsoft StreamInsight, Spark Streaming, Apache Beam, and Apache Flink.

有趣的是,当今业界正在开发的流处理系统与我们和数据库社区中的其他人多年前构建的流处理系统基本上相同。目标是处理无限的元组流。元组是结构化记录并包含时间戳。处理涉及将数据分组到窗口中进行聚合并确保高可用性和容错能力。然而,最近的系统确实有一些不同的侧重点我们来自数据库社区的原创作品。特别是,他们寻求用于批量数据处理和流处理的单一编程模型。他们更加关注并行、无共享流处理,并寻求在 Python 和 Java 中提供强大的 API 以及对用户定义函数的无缝支持。

Interestingly, the stream-processing systems being developed in industry today are fundamentally the same as the ones as we and others in the database community built all those years ago. The goal is to process unbounded streams of tuples. Tuples are structured records and include timestamps. Processing involves grouping data into windows for aggregation and ensuring high availability and fault tolerance. Recent systems, however, do have a somewhat different emphasis than our original work from the database community. In particular, they seek a single programming model for batch data processing and stream processing. They focus significantly more on parallel, shared-nothing stream processing, and they seek to provide powerful APIs in Python and Java as well as seamless support for user-defined functions.

所有第一代流媒体系统都忽略了交易的需求。MIT-Brown-Brandeis 的研究在一个名为 S-Store 的项目中继续进行 [Çetintemel 等人。2014], 9尝试将事务集成到流引擎中 [Meehan 等人。2015a,Meehan 等人。2015b]。

All first-generation streaming systems ignored the need for transactions. The MIT-Brown-Brandeis research continues in a project called S-Store [Çetintemel et al. 2014],9 which attempts to integrate transactions into a streaming engine [Meehan et al. 2015a, Meehan et al. 2015b].

我们最初的 Aurora/Borealis 工作经受住了时间的考验,而今天的流媒体引擎自然而然地建立在过去的想法之上,这证明了 Mike 的远见和领导力。我们在 SIGMOD'05 上发表的一篇关于 Borealis 容错性的论文 [Balazinska 等人。2005],在 SIGMOD'17 上荣获“时间考验”奖。

It is a testament to Mike’s vision and leadership that our original Aurora/Borealis work stood the test of time and that today’s streaming engines so naturally build on those past ideas. One of our papers, on fault tolerance in Borealis at SIGMOD’05 [Balazinska et al. 2005], won a “test of time” award at SIGMOD’17.

致谢

Acknowledgments

感谢 StreamBase Systems 联合创始人兼业务开发副总裁 John Partridge 分享他对本章的回忆和故事。

Thank you to John Partridge, StreamBase Systems Co-Founder and VP Business Development, for sharing his memories and stories for this chapter.

1 . http://oxygen.csail.mit.edu。上次访问时间为 2018 年 5 月 16 日。

1. http://oxygen.csail.mit.edu. Last accessed May 16, 2018.

2 . “现在人们基本上已经忘记了 RFID(射频识别)标签,但当时它是一个巨大的投资领域。沃尔玛和美国空军购买了大量 RFID 硬件和软件来实时管理其供应链,最初是在调色板级别,最终是在项目级别进行规划。这些 RFID 遥测数据流迫切需要像 StreamBase 这样的优秀应用程序开发平台。—John Partridge,StreamBase 联合创始人兼业务开发副总裁

2. “It’s largely forgotten now but RFID (radio frequency identification) tagging was then a huge area of investment. Walmart and the USAF had purchased a lot of RFID hardware and software to manage their supply chains in real time, initially at the palette level and planned eventually at the item level. Those streams of RFID telemetry data cried out for a good application development platform like StreamBase’s.—John Partridge, StreamBase Co-Founder and VP Business Development

3 . “此处未列出的一个见解是,Aurora 提供了一种基于工作流的图解“语言”,它比 SQL 等传统声明性语言更适合实时应用程序开发。基本上,它暴露了通常对应用程序开发人员隐藏(或大部分隐藏)的查询规划器,并将其本身转变为一种富有表现力且功能强大的语言。开发人员喜欢它,因为它提高了他们的工作效率——它是完成这项工作的正确工具。它使他们能够直接控制每个元组的处理,而不是让他们担心反复无常的查询优化器可能决定做什么。”——John Partridge

3. “One insight not listed here is that Aurora offered a workflow-based diagrammatic “language” that was much better suited to real-time application development than a traditional declarative language like SQL. Basically, it exposed the query planner that usually is hidden (or mostly hidden) from application developers and turned it into an expressive and powerful language in its own right. Developers liked it because it made them more productive—it was the right tool for the job. It gave them direct control over the processing of each tuple, rather than leaving them to worry what a capricious query optimizer might decide to do.”—John Partridge

4 . 参见第 26 章

4. See Chapter 26.

5 . “这种洞察力体现了迈克的远见,他想到了这一点是件好事。与我们早期潜在客户(投资银行和对冲基金的交易柜台)的对话很快表明,他们想要一种干净的方式将实时处理与查询历史数据(有时是几年前的数据,有时是几小时或几分钟前的数据)集成起来。连接点功能因此迅速成熟。”—John Partridge

5. “This insight is an example of Mike’s foresight and it was a good thing he thought of it. Conversations with our early potential customers (trading desks at investment banks and hedge funds) quickly revealed that they wanted a clean way of integrating their real-time processing with querying historical data (sometimes years old, sometimes hours or minutes old). Connection points functionality matured quickly as a result.”—John Partridge

6 . 参见第 26 章

6. See Chapter 26.

7 . 最初的公司名称应该是“Grassy Pond”,只不过域名是在当时被占用的。所以我买了grasybrook.com。——约翰·帕特里奇

7. The initial company name would have been “Grassy Pond” except that the domain name was taken at the time. So I bought grassybrook.com.—John Partridge

8 . 遵循迈克目前已确立的创办公司的模式,如第 7 章所述。

8. Following Mike’s by now well-established pattern for starting a company, described in Chapter 7.

9 . http://sstore.cs.brown.edu/(上次访问时间:2018 年 3 月 28 日。)

9. http://sstore.cs.brown.edu/ (Last accessed March 28, 2018.)

18

18

便利店:通过博士的视角 学生

C-Store: Through the Eyes of a Ph.D. Student

丹尼尔·J·阿巴迪

Daniel J. Abadi

我第一次见到迈克·斯通布雷克是在我还是一名本科生的时候。在遇到 Mike 之前,我无意从事计算机科学事业,更不想从事计算机科学研究事业。

I first met Mike Stonebraker when I was an undergraduate. Before I met Mike, I had no intention of going into a career in computer science, and certainly not into a career of computer science research.

我一直都知道我想从事一项对人们的生活产生影响的职业,并且我得出的结论是,成为一名医生将是实现这一职业目标的最佳方式。当我遇到迈克时,我已经完成了医学预科本科课程要求的四分之三,并且正在约翰·利斯曼的实验室工作,研究记忆丧失的生物学原因。然而,为了赚取一些额外收入,我还在 Mitch Cherniack 的实验室从事查询优化工作。Mitch 让我参加了 Aurora 项目的早期研究会议(见第 17 章),通过这次会议我认识了 Mike。

I always knew that I wanted to go into a career that made an impact on people’s lives and I had come to the conclusion that becoming a doctor would be the optimal way to achieve this career goal. At the time I met Mike, I was three quarters of the way through the pre-med undergraduate course requirements and was working in John Lisman’s lab researching the biological causes of memory loss. However, to earn some extra income, I was also working in Mitch Cherniack’s lab on query optimization. Mitch included me in the early research meetings for the Aurora project (see Chapter 17), through which I met Mike.

我如何成为一名计算机科学家

How I Became a Computer Scientist

当我遇到迈克时,我在计算机科学研究方面的唯一经验涉及编写工具,这些工具使用自动定理证明器来验证数据库系统中查询优化期间查询重写规则的正确性。我发现这个项目在智力上令人兴奋,技术上也很深入,但我意识到,为了让我的研究产生影响,必须发生以下一系列事件。

At the time I met Mike, my only experience with computer science research involved writing tools that used automated theorem provers to validate the correctness of query rewrite rules during query optimization in database systems. I found the project to be intellectually stimulating and technically deep, but I was conscious of the fact that in order for my research to have an impact, the following chain of events would have to occur.

1. 我必须写一篇研究论文来描述我们正在研究的自动正确性证明。

1.  I would have to write a research paper that described the automated correctness proofs that we were working on.

2. 该论文必须在数据库系统研究的公开出版场所接受发表。

2.  This paper would have to be accepted for publication in a visible publication venue for database system research.

3.  构建真实系统的人必须阅读我的论文,并认为该论文引入的技术比其他选项更能确保重写规则的正确性,因此将我们的技术集成到他们的系统中。

3.  Somebody who was building a real system would have to read my paper and decide that the techniques introduced by the paper were a better way to ensure correctness of rewrite rules than alternative options, and therefore integrate our techniques into their system.

4. 真实的系统必须由真实的人部署以实现真实的应用程序。

4.  The real system would then have to be deployed by real people for a real application.

5. 真实的应用程序必须向系统提交一个查询,对于该查询,通用数据库系统会产生错误的答案,但由于他们使用的是集成了我们技术的系统,因此会产生正确的答案。

5.  That real application would have to submit a query to the system for which a generic database system would have produced the wrong answer, but since they were using a system that integrated our techniques, the correct answer was produced.

6. 错误答案和正确答案之间的差异必须足够大,否则会导致在现实世界中做出错误的决定。

6.  The difference between the wrong answer and the correct answer had to be large enough that it would have led to an incorrect decision to be made in the real world.

7. 这个错误的决定需要产生现实世界的后果。

7.  This incorrect decision needed to have real-world consequences.

如果这个链条中的任何一个环节未能实现,研究的影响将受到严重限制。因此,我当时的信念是,计算机科学研究主要是数学和理论的,对现实世界的影响是可能的,但概率很小。

If any link in this chain failed to come to fruition, the impact of the research would be severely limited. Therefore, my belief at the time was that computer science research was mostly mathematical and theoretical, and that real-world impact was possible but had long-shot probability.

我与迈克的早期互动很快就打消了我的这个想法。任何在与迈克的会面中包含数学或理论的尝试都会被轻蔑地置之不理,或者(更常见的是)在完成在白板上写下想法的过程后,我们会瞥一眼迈克,发现他已经输了意识清醒,脸对着天花板睡着了。我们提出的任何想法都会受到有关可行性、实用性以及我们如何在现实世界的数据集和工作负载上测试该想法的问题的回应。我从迈克那里学到了以下关于实现现实世界影响的经验法则。

My early interactions with Mike very quickly disabused me of this notion. Any attempt to include math or theory in a meeting with Mike would either be brushed aside with disdain, or (more commonly) after completing the process of writing down the idea on the white board, we would glance at Mike and notice that he had lost consciousness, fast asleep with his face facing the ceiling. Any idea that we would introduce would be responded to with questions about feasibility, practicality, and how were we going to test the idea on real-world datasets and workloads. I learned the following rules of thumb from Mike regarding achieving real-world impact.

1. 必须不惜一切代价避免复杂性。最有影响力的想法都是简单的想法,原因如下。

1.  Complexity must be avoided at all costs. The most impactful ideas are simple ideas for the following reasons.

(a) 复杂的想法需要人们付出更多的努力来阅读和理解。如果你希望人们阅读你写的论文,你应该尽量减少要求他们阅读你的论文的努力。

(a)  Complex ideas require more effort for somebody to read and understand. If you want people to read the papers that you write, you should minimize the effort you ask them to go through in reading your paper.

(b) 复杂的想法很难沟通。影响力的传播不仅通过您自己通过论文和演示来传达您的想法,还可以通过其他人在他们的论文和演示文稿中总结和引用您的想法。想法越简单,其他人就越有可能向第三方描述它。

(b)  Complex ideas are hard to communicate. Impact spreads not only through your own communication of your ideas via papers and presentations, but also through other people summarizing and referring to your ideas in their papers and presentations. The simpler the idea, the more likely someone else will be able to describe it to a third party.

(c) 复杂的想法很难实施。要发布一个想法,您通常必须实施它以便对其进行实验。实施起来越困难,构建所需的时间就越长,这会降低研究小组的整体生产力。

(c)  Complex ideas are hard to implement. To get an idea published, you generally have to implement it in order to run experiments on it. The harder it is to implement, the longer it takes to build, which reduces the overall productivity of a research group.

(d) 复杂的想法很难重现。实现影响力的一种方法是让其他人采纳您的想法并将其实施到他们的系统中。但如果这个过程很复杂,他们就不太可能这样做。

(d)  Complex ideas are hard to reproduce. One way of achieving impact is for other people to take your ideas and implement them in their system. But if that process is complicated, they are less likely to do so.

(e) 复杂的想法很难商业化。归根结底,想法的商业化需要沟通(通常是与非技术人员)和快速实施。因此,上述复杂想法的沟通和实施障碍也成为商业化的障碍。

(e)  Complex ideas are hard to commercialize. At the end of the day, commercialization of the idea requires communication (often to nontechnical people) and rapid implementation. Therefore, the communication and implementation barriers of the complex ideas mentioned above also serve as barriers to commercialization.

2. 构建一个完整的系统比将研究集中在系统的单个部分更好。造成这种情况的主要原因有三个。

2.  It is better to build a complete system than it is to focus your research on just a single part of a system. There are three main reasons for this.

(a) 对系统的孤立部分的狭隘想法可能会带来无关紧要的风险,因为系统的不同部分可能是现实世界部署的瓶颈。迈克总是会问“帐篷里的高杆”——确保我们的研究工作针对系统的真正瓶颈。

(a)  Narrow ideas on isolated parts of the system risk being irrelevant because a different part of the system may be the bottleneck in real-world deployments. Mike would always be asking about “high poles in the tent”—making sure that our research efforts were on real bottlenecks of the system.

(b) 系统组件以有趣的方式相互作用。如果研究仅关注系统的单个部分,则研究将无法观察组件之间的这些重要相互作用。

(b)  System components interact with each other in interesting ways. If research focuses on just a single part of the system, the research will not observe these important interactions across components.

(c) 整个系统的商业化通常比单个组件的商业化更容易。单个组件的商业化需要与现有软件供应商建立深入的合作关系,而这对于刚起步的初创公司来说通常很难实现。可以从头开始构建完整的系统,在实施过程中无需依赖第三方。

(c)  It is generally easier to commercialize an entire system than just a single component. The commercialization of individual components requires deep partnership with existing software vendors, which a young, fledging startup usually struggles to achieve. A complete system can be built from scratch without relying on third parties during the implementation effort.

3. 通过商业化可以加速影响力。在计算机科学领域发表了许多伟大的想法,这些想法通过其他人阅读论文并实施这些想法而产生了重要影响。然而,绝大多数发表在学术场所在现实世界中从未出现过。那些真正进入现实世界的技术几乎从来都不是立竿见影的——在某些情况下,技术转移和应用之前会延迟一两年。提高将想法转移到现实世界的可能性和速度的最佳方法是努力筹集资金,围绕该想法组建一家公司,构建一个至少包含最少功能集的原型(包括构成研究项目中心论文的新颖功能)使原型能够用于实际应用的生产,并将其发布给潜在客户和合作伙伴。

3.  Impact can be accelerated via commercialization. There have been many great ideas that have been published in computer science venues that have made important impact via somebody else reading the paper and implementing the idea. However, the vast majority of ideas that are published in academic venues never see the light of day in the real world. The ones that do make it to the real world are almost never immediate—in some cases there is a delay of a decade or two before the technology is transferred and applied. The best way to increase both the probability and speed of transferring an idea to the real world is to go to the effort of raising money to form a company around the idea, build a prototype that includes at least the minimal set of features (including the novel features that form the central thesis of the research project) that enable the prototype to be used in production for real applications, and release it to potential customers and partners.

研究技术直接商业化的第二个优势是接触特定技术的现实世界影响。这种经验通常可以反馈到研究实验室,以便未来研究项目取得成功。

A second advantage to direct commercialization of research technology is exposure to real-world ramifications of a particular technology. This experience can often be fed back into a research lab for successful future research projects.

简而言之,迈克告诉我,计算机科学研究可能比我想象的要直接得多。此外,其影响比我一直在考虑的本地化职业更具扩展性。最后,我决定申请研究生院并从事研究工作。

In short, Mike taught me that computer science research could be far more direct than I had realized. Furthermore, the impact is much more scalable than the localized professions I had been considering. In the end, I decided to apply to graduate school and pursue a career in research.

便利店的理念、演变和影响

The Idea, Evolution, and Impact of C-Store

当我申请麻省理工学院(2003 年秋季入学)时,除了迈克(他最近作为兼职教授加入麻省理工学院)之外,麻省理工学院没有数据库系统教师。尽管如此,迈克还是帮助我确保了麻省理工学院的申请能够被接受,尽管他无意担任博士生顾问的角色。早年在麻省理工学院的学生。迈克为我匹配了哈里·巴拉克里希南(Hari Balakrishnan)作为我的临时官方顾问,同时他继续非正式地为我提供建议。此后不久,萨姆·马登 (Sam Madden) 加入了麻省理工学院,萨姆、迈克和我,以及来自布朗大学、麻省大学和布兰代斯大学的团队,开始致力于便利店项目 [Stonebraker 等人,2017]。2005a] 2004 年。

At the time I applied to MIT (for admission in the fall of 2003), there were no database system faculty at MIT aside from Mike (who had recently joined MIT as an adjunct professor). Nonetheless, Mike helped to ensure that my application to MIT would be accepted, even though he had no intention of taking on the role of advising Ph.D. students in his early years at MIT. Mike matched me up with Hari Balakrishnan as my temporary official advisor, while he continued to advise me unofficially. Shortly thereafter, Sam Madden joined MIT, and Sam, Mike, and I, along with teams from Brown, UMass, and Brandeis, began to work on the C-Store project [Stonebraker et al. 2005a] in 2004.

从那时起,对便利店项目的早期研究就形成了我进行研究的方法。便利店从来都不是为了创新而创新。便利店一开始,迈克就利用他在行业中的经验和人脉说:“这里有一个主要的痛点。“三巨头”数据库系统(Oracle、IBM 的 DB2 和 Microsoft 的 SQL Server)都无法将查询扩展至即将到来的“大数据”时代所需的程度,并且其他现有解决方案效率极低。让我们构建一个能够有效扩展和处理查询的系统。”

The early days of the research on the C-Store project have formed my approach to performing research ever since that point. C-Store was never about innovating just for the sake of innovating. C-Store started with Mike taking his experience and connections in industry and saying, “There’s a major pain point here. None of the ‘Big Three’ database systems—Oracle, IBM’s DB2, and Microsoft’s SQL Server—scale queries to the degree that the upcoming ‘Big Data’ era will require, and other existing solutions are wildly inefficient. Let’s build a system that will scale and process queries efficiently.”

从这个例子中我们可以看到迈克所做的两件关键的事情对于产生影响具有启发性。

We can see from this example two key things that Mike did that are instructive about making impact.

1.  他找到了现有痛苦的根源。如果你想产生影响,你必须在那些存在太多现有问题的领域进行研究,以至于人们会支付转换成本来采用你的解决方案(如果他们可以使用)。

1.  He found a source of existing pain. If you want to make impact, you have to do research in areas where there is so much existing pain that people will pay the switching costs to adopt your solution if it becomes available to them.

2.  他发现了一种趋势,这种趋势会在爆发前加剧痛苦。2004 年该项目启动时,“大数据”才刚刚作为一个行业术语出现。

2.  He identified a trend that would magnify the pain before it took off. “Big Data” was only just emerging as an industry term in 2004 when the project started.

C-Store 项目涉及创建一个可扩展的数据库系统,该系统针对以读取为主的工作负载进行了优化,例如数据仓库环境中的工作负载(即几乎完全是读取查询的工作负载,偶尔批量附加新记录和很少的更新)之前插入的记录)。该项目包括存储层的两个组件:只读组件(存储大部分数据)和可写组件。对现有记录的更新是通过从只读组件中删除并将新记录插入到可写组件中来处理的。由于是只读的,只读组件能够进行多项优化,包括密集包装数据和索引,

The C-Store project involved creating a scalable database system that is optimized for read-mostly workloads, such as those found in data warehousing environments (i.e., workloads that are almost entirely read queries, with occasional batch appends of new records and rare updates of previously inserted records). The project included two components to the storage layer: a read-only component (where most of the data was stored) and a writable component. Updates to existing records were handled by deleting from the read-only component and inserting a new record into the writable component. By virtue of being read-only, the read-only component was able to make several optimizations, including dense-packing the data and indexes, keeping data in strictly sorted order (including redundantly storing different materialized views or “projections” in different sort orders), compressing the data, reading data in large blocks from disk, and using vastly simplified concurrency control and recovery protocols.

与只读组件相比,可写组件通常存储在内存中并针对插入新记录进行优化。这些插入可以在一段时间内缓慢发生(它们被称为“涓流更新”),也可以批量发生(例如,当前一天的数据日志过夜写入数据仓库时)。所有查询都将包含来自只读和可写组件的数据(来自两个组件的数据将在查询时动态合并)。然而,重要的是可写组件完全适合主存储器。因此,“元组移动器”会作为后台进程将数据从可写组件批量移动到只读组件。

In contrast to the read-only component, the writable component was generally stored in-memory and optimized for inserting new records. These inserts can happen slowly over a period of time (they were known as “trickle updates”) or they can happen in batch (e.g., when a previous day’s log of data is written to the data warehouse overnight). All queries would include data from both the read-only and writable component (data from the two components would be dynamically merged on the fly at query time). However, it was important that the writable component fit entirely in main memory. Therefore, a “tuple mover” would move data from the writable component to the read-only component in batches as a background process.

C-Store 因在“列”中存储数据而闻名。一般来说,数据库表是一个二维对象,在数据写入存储时需要序列化为一维存储接口。当时大多数(但不是全部)数据库系统逐行执行这种序列化处理数据。首先,他们将存储第一行,然后是第二行,依此类推。相比之下,面向列的系统(例如 C-Store)逐列存储数据。存储列单独帮助优化扫描多行的只读查询的系统,因为通过单独存储列,系统只需要花费 I/O 时间从存储中读取回答查询所需的特定列。对于访问表中一小部分列的查询,性能优势将是巨大的。但是,插入新元组可能会很慢,因为元组中的不同属性必须写入不同的位置。这就是为什么在内存中拥有一个单独的可写组件至关重要。事实上,在C-Store的一些早期设计中,只有只读组件以列的形式存储数据,而可写组件则采用传统的面向行的设计。

C-Store was most famous for storing data in “columns.” In general, a database table is a two-dimensional object and needs to be serialized to a one-dimensional storage interface when the data is written to storage. Most (but not all) database systems at the time performed this serialization process data row by row. First, they would store the first row, and then the second, and so on. In contrast, column-oriented systems such as C-Store stored data column by column. Storing columns separately helped to optimize the system for read-only queries that scan through many rows, since by storing columns separately, the system only needs to expend I/O time reading from storage the specific columns necessary to answer the query. For queries that accessed a small percentage of the columns in a table, the performance benefits would be large. However, inserting a new tuple can be slow since the different attributes in the tuple have to written to separate locations. This is why it was critical to have a separate writable component that was in-memory. Indeed, in some of the early designs of C-Store, only the read-only component stored data in columns, while the writable component used a traditional row-oriented design.

与 Mike 一起建造便利店

Building C-Store with Mike

我在 C-Store 项目中的工作是协作设计和编写代码,以实现稍后的核心读取优化存储和查询执行引擎。因此,我有机会在白板上与 Mike 和 Sam 一起度过了很多时间,讨论系统这些部分的一些不同设计决策的权衡。当我回顾我在研究生院的时光时,我非常深情地想起这段时间:准备这些会议的兴奋,这些会议期间的来回,然后是之后在我的脑海中重播会议的过程,回顾亮点和低点,并计划下次我会采取哪些不同的做法,以便更好地说服迈克相信我在会议期间试图提出的想法。

My job within the C-Store project was to collaborate on the design and write code to implement both the core read-optimized storage later and the query execution engine. I therefore had the opportunity to spend many hours with Mike and Sam at the whiteboard, discussing the tradeoffs of some of the different design decisions of these parts of the system. When I look back at my time in graduate school, I think of this time very fondly: the excitement surrounding preparing for one of these meetings, the back and forth during these meetings, and then the process of replaying the meeting in my head afterward, reviewing the highlights and lowlights and planning for what I would do differently next time to try to do a better job convincing Mike of an idea that I had tried to present during the meeting.

总的来说,迈克对于系统设计决策有着强烈的直觉。然而,结果是,任何与他的本能背道而驰的想法几乎没有机会看到曙光,除非有人真正努力构建这个想法并产生无可争议的证据(考虑到这个想法违背了这一事实)根据迈克的直觉,不太可能)。

In general, Mike has a tremendous instinct for making decisions on the design of a system. However, as a result, any idea that runs counter to his instinct has almost no chance of seeing the light of day without somebody actually going to the effort of building the idea and generating incontrovertible proof (which, given that fact that the idea runs counter to Mike’s instinct, is unlikely to be possible).

一个例子是围绕稍后应该在存储中使用的压缩方法的讨论。列存储为在新的视角下重新审视数据库压缩提供了巨大的机会:列存储不仅通过连续存储来自同一属性域的数据,观察到更小的数据熵(因此通常更适合压缩) ),而且它们还可以使用不同的压缩方案来压缩表的每一列。Mike 创建了一个四元图(参见图 18.1),用于说明每列应使用哪种压缩方案。四元图的两个维度是:(1) 列已排序,(2) 列的基数有多高(唯一值的数量)。

One example of this is a discussion around the compression methods that should be used in the storage later. Column-stores present a tremendous opportunity for revisiting database compression under a new lens: Not only do column-stores, by virtue of storing data from the same attribute domain contiguously, observe much smaller data entropy (and therefore are more amenable to compression in general), but also, they make it possible to compress each column of a table using a different compression scheme. Mike had created a quad chart (see Figure 18.1) for what compression scheme should be used for each column. The two dimensions of the quad chart were: (1) is the column sorted and (2) how high is the cardinality of the column (the number of unique values).

图像

图 18.1  用于在 C-Store 中选择压缩算法的 Stonebraker 四元图。

Figure 18.1  A Stonebraker quad chart for choosing compression algorithms in C-Store.

我提出了一个建议,我们应该使用算术编码,而不是对未排序的低基数列使用位向量编码。这导致了我在研究生院的电子邮件历史记录中发现的以下电子邮件对话:

I had made a proposal that instead of using Bit-vector encoding for unsorted, low-cardinality columns, we should instead use arithmetic encoding. This led to the following email conversation that I found in my email history from graduate school:

迈克写信给我:

Mike wrote to me:

我没有看到算术编码的优势。与我们现有的相比,它充其量只能达到收支平衡(空间方面),并增加解码成本。此外,它妨碍了将列作为位图在运算符之间传递。此外,编码方案必须“学习”字母表,这必须是“每列”。添加新值将是一个问题。

I don’t see the advantage of arithmetic coding. At best it would break even (space-wise) compared to what we have and adds decoding cost. Moreover, it gets in the way of passing columns between operators as bit maps. Also, coding scheme would have to “learn” alphabet, which would have to be “per column.” Adding new values would be a problem.

我回信说:

I wrote back:

预期的优势是节省空间(以及 I/O 成本)。我不知道你为什么说我们最多只能实现收支平衡。我们现在看到的[未排序、低基数列] 大约有 4 倍压缩,这就是字典所能得到的。根据数据概率,算术运算将在此基础上再增加约 2 倍。

The anticipated advantage would be the space savings (and thus i/o cost). I’m not sure why you’re saying we will break even at best. We’re seeing about 4× compression for the [unsorted, low cardinality columns] we have now, which is about what dictionary would get. Depending on the data probabilities, arithmetic would get about an additional 2× beyond that.

对此他回应道:

To which he responded:

我不相信你。位图的游程长度编码(原文如此,RLE)应该与算术相同或更好。然而,这个问题必须用实数来解决。

I don’t believe you. Run length encoding (sic. RLE) of the bit map should do the same or better than arithmetic. However, this question must get settled by real numbers.

此外,算术对于新值也有一个大问题。如果不重新编码自零时间以来所看到的每个值,您就无法即时更改代码簿......

Also, arithmetic has a big problem with new values. You can’t change the code book on the fly, without recoding every value you have seen since time zero .…

这次谈话是让迈克相信任何事情的典型“证明它”的要求。我继续花时间仔细地将算术编码集成到系统中。最后,迈克是对的:算术编码对于数据库系统来说不是一个好主意。我们观察到良好的压缩比,但解压缩速度太慢,无法在面向性能的系统中生存。1

This conversation was typical of the “prove it” requirement of convincing Mike of anything. I went ahead and spent time carefully integrating arithmetic encoding into the system. In the end, Mike was right: Arithmetic coding was not a good idea for database systems. We observed good compression ratios, but the decompression speed was too slow for viability in a performance-oriented system.1

从便利店项目中获得的更多有关一般系统研究的经验教训。

A few more lessons from the C-Store project about systems research in general.

1. 这篇论文的想法都不是新的。列存储已经存在(Sybase IQ 是第一个广泛部署的列存储商业实现)。无共享数据库系统已经存在。C-Store 的主要贡献是将几种不同的想法结合起来,使整体大于各部分的总和。

1.  None of the ideas of the paper were new. Column-stores had already been in existence (Sybase IQ was the first widely deployed commercial implementation of a column-store). Shared-nothing database systems had already been in existence. C-Store’s main contribution was the combination of several different ideas in a way that made the whole more than the sum of its parts.

2. 尽管列存储的想法在 C-Store 论文出现之前已经存在了二十多年,但由于一些限制性的设计决策而受到阻碍。有时,现有想法中的创新可以将其从不广泛实用转变为如此实用,以至于该想法变得无处不在。一旦列存储流行起来,所有主要的 DBMS 都开发了列存储扩展,要么通过在页面中逐列存储数据(例如 Oracle),要么在某些情况下使用真正的列存储存储管理器(例如 IBM Blu 和 Microsoft SQL)阿波罗服务器)。

2.  Although the idea of column-stores had been around for over two decades before the C-Store paper, they were held back by several limiting design decisions. Sometimes, innovations within an existing idea can turn it from not really widely practical, to being so practical that the idea becomes ubiquitous. Once column-stores caught on, all major DBMSs developed column-store extensions, either by storing data column by column within a page (e.g., Oracle) or in some cases with real column-store storage managers (such as IBM Blu and Microsoft SQL Server Apollo).

创立 Vertica 系统

Founding Vertica Systems

如上所述,加速研究影响的一个好方法是创办一家将研究商业化的公司。2 Mike 的职业生涯也许是这种方法的典型例子,Mike 不断地将技术从他的实验室转移到商业化工作中。就 C-Store 而言,从研究项目到初创公司 (Vertica) 的转变非常迅速。Vertica 的第一任首席执行官和工程副总裁已于 2005 年 3 月就任,同月我们提交了最初的便利店 VLDB 论文 [Stonebraker 等人。2005a] 由 VLDB 计划委员会审查。

As mentioned above, a great way to accelerate research impact is to start a company that commercializes the research.2 Mike’s career is perhaps the quintessential example of this approach, with Mike repeatedly transferring technology from his lab into commercialization efforts. In the case of C-Store, the transition from research project to startup (Vertica) was extremely rapid. Vertica’s first CEO and VP of Engineering were already in place in March 2005, the same month in which we submitted the original C-Store VLDB paper [Stonebraker et al. 2005a] to be reviewed by the VLDB program committee.

我很荣幸有机会在技术转让过程中与 Mike 和 Vertica 的首任首席执行官 (Andy Palmer) 密切合作。3典型流程对于任何此类技术初创公司来说,都是为了寻找用例而开始商业化工作。尽管我们相信便利店技术广泛适用于许多应用程序空间,但不同的领域将有不同的现有解决方案来处理其数据和查询工作负载。对于某些领域来说,现有的解决方案已经足够好,或者足够好,以至于它们不会遭受太大的痛苦。在其他领域,当前的解决方案还不够充分,他们愿意冒险尝试来自一家小型初创公司的未经验证的新技术。显然,技术初创公司的第一个版本应该关注这些领域。因此,我们花了很多时间与首席信息官、首席技术官通电话(有时一起,有时单独),

I was privileged to have the opportunity to work closely with Mike and Vertica’s first CEO (Andy Palmer) in the technology transfer process.3 The typical process for any technology startup of this kind is to begin the commercialization effort in search of use cases. Although we were confident that the C-Store technology was widely applicable across many application spaces, different domains will have different existing solutions to handle their data and query workload. In the case of some domains, the existing solution is good enough, or close enough to being good enough that they are not in significant pain. In other domains, the current solutions are so insufficient, they would be willing to risk trying out a new and unproven technology from a tiny startup. Obviously, it is those domains that the first versions of a technology startup should focus on. Therefore, we spent many hours on the phone (sometimes together, sometimes independently) with CIOs, CTOs, and other employees at companies from different domains that were in charge of the data infrastructure and data warehousing solutions, trying to gauge how much pain they were currently in, and how well our technology could alleviate that pain.

事实上,我和 Mike 在一起最长的连续时间是我们在 Vertica 早期进行的一次公路旅行。我乘火车前往新罕布什尔州埃克塞特,并在安迪·帕尔默的家里过夜。第二天早上,我们在曼彻斯特接了迈克,我们三人驱车前往康涅狄格州中部,会见一位潜在的阿尔法客户,一家大型在线旅游公司。关于这次会议,我记得两件事:

Indeed, my longest continuous time with Mike was a road trip we took in the early days of Vertica. I took a train to Exeter, New Hampshire, and spent a night in Andy Palmer’s house. The next morning, we picked up Mike in Manchester, and the three of us drove to central Connecticut to meet with a potential alpha customer, a large online travel company. The two things about this meeting that I remember:

1. 在 CIO 提及之前,Mike 就能够预测到数据基础设施团队面临的各种痛点,对此感到敬畏。有趣的是,行业中存在一些普遍存在的痛点,以至于 CIO 甚至没有意识到自己正处于痛苦之中。

1.  Being in awe as Mike was able to predict the various pain points that the data infrastructure team was in before the CIO mentioned them. Interestingly, there were some pain points that were so prevalent in the industry that the CIO did not even realize that he was in pain.

2. C-Store/Vertica 最初的主要关注点是存储层,并且有一个非常基本的优化器第一个版本,仅适用于星型模式。4我记得迈克和他们团队的一名成员之间的一次有趣的交流,其中迈克试图说服他,即使他们实际上没有星型模式,如果他们眯着眼睛从 10,000 英尺的地方观察他们的模式,那确实是星型模式。

2.  C-Store/Vertica’s main initial focus was on the storage layer and had a very basic first version of the optimizer that worked only with star schemas.4 I remember an amusing exchange between Mike and a member of their team in which Mike tried to convince him that even though they didn’t actually have a star schema, if they would squint and look at their schema from 10,000 ft, it was indeed a star schema.

最终,Vertica 取得了巨大的成功。它于2011年被惠普以大笔收购,并通过收购保持成功。如今,Vertica 拥有数千名付费客户,还有更多使用该软件免费社区版的客户。

In the end, Vertica was a huge success. It was acquired by Hewlett-Packard in 2011 for a large sum and remained successful through the acquisition. Today, Vertica has many thousands of paying customers, and many more using the free community edition of the software.

更重要的是,C-Store查询执行引擎的核心设计,直接操作压缩的、面向列的数据,已经成为业界的主流。现在,每个主要数据库供应商都提供了面向列的选项,并且针对以读取为主的工作负载提供了经过验证的性能改进。我感到非常幸运。我将永远感谢 Mike Stonebraker 对我的信任,让我有机会在 C-Store 上与他合作,并从那时起支持我作为数据库系统研究员的职业生涯。

More importantly, the core design of C-Store’s query execution engine, with direct operation on compressed, column-oriented data, has become prevalent in the industry. Every major database vendor now has a column-oriented option, with proven performance improvements for read-mostly workloads. I feel very fortunate. I will forever be grateful to Mike Stonebraker for believing in me and giving me the opportunity to collaborate with him on C-Store, and for supporting my career as a database system researcher ever since.

1 . 然而,我最终说服了 Mike,我们应该放弃位向量压缩,而对未排序的低基数列使用字典压缩。

1. I did, however, eventually convince Mike that we should move away from bit-vector compression and use dictionary compression for unsorted, low cardinality columns.

2 . 关于Vertica产品的开发故事,请参见第27章

2. For the story of the development of the Vertica product, see Chapter 27.

3 . 有关 Vertica 成立的更多信息,请阅读第 8 章

3. For more on the founding of Vertica, read Chapter 8.

4 . 有关星型模式重要性的更多信息,请阅读第 14 章

4. For more about the significance of star schemas, read Chapter 14.

19

19

内存中、水平和事务:H-Store OLTP DBMS 项目

In-Memory, Horizontal, and Transactional: The H-Store OLTP DBMS Project

安迪·帕夫洛

Andy Pavlo

我记得第一次听到迈克·斯通布雷克这个名字时的情景。完成本科学位后,我于 2005 年被威斯康星大学聘为系统程序员,为 Miron Livny 开发 HTCondor 这是一个高吞吐量作业执行系统项目。我隔壁办公室的同事 (Greg Thain) 负责从 David DeWitt 的研究小组移植一个名为CondorDB 1的 HTCondor 版本。CondorDB 的要点是它使用 Postgres 作为其支持数据存储,而不是其自定义内部数据文件。尽管我的朋友从来都不是迈克的学生,但他给我讲了关于迈克所有研究成就的故事(安格尔、Postgres、马里波萨)。

I remember the first time that I heard the name Mike Stonebraker. After I finished my undergraduate degree, I was hired as a systems programmer at the University of Wisconsin in 2005 to work on the HTCondor, a high-throughput job execution system project, for Miron Livny. My colleague (Greg Thain) in the office next to me was responsible for porting a version of HTCondor called CondorDB1 from David DeWitt’s research group. The gist of CondorDB was that it used Postgres as its backing data store instead of its custom internal data files. Although my friend was never a student of Mike’s, he regaled me with stories about all of Mike’s research accomplishments (Ingres, Postgres, Mariposa).

2007 年,我离开威斯康星州,进入布朗大学读研究生。我的初衷是与布朗大学的另一位系统教授一起工作。最初几周,我与那位教授一起尝试了几个项目想法,但我无法找到任何好的牵引力。然后,我职业生涯中最偶然的事情发生在课程的第二周之后。我正在上 Stan Zdonik 的数据库课,他突然问我是不是叫安迪·帕夫洛 (Andy Pavlo)。我说是。” 斯坦随后表示,前一天晚上他已与迈克·斯通布雷克 (Mike Stonebraker) 和大卫·德威特 (David DeWitt) 通了电话,讨论加快H-Store的开发事宜 项目,德威特推荐我作为斯坦应该招募加入团队的人。

I left Wisconsin in 2007 and enrolled at Brown University for graduate school. My original intention was to work with another systems professor at Brown. The first couple of weeks I dabbled in a couple of project ideas with that professor, but I was not able to find good traction with anything. Then the most fortuitous thing in my career happened after about the second week of classes. I was in Stan Zdonik’s database class when suddenly he asked me if my name was Andy Pavlo. I said “yes.” Stan then said that he had had a phone call with Mike Stonebraker and David DeWitt the previous night about ramping up development for the H-Store project and that DeWitt recommended me as somebody that Stan should recruit to join the team.

起初,我很犹豫要不要这样做。在我决定离开威斯康星州去布朗大学读研究生后,我与德威特进行了临别谈话。他对我说的最后一句话是我不应该和斯坦·兹多尼克一起工作。他从来没有告诉过我原因。后来我了解到,这是因为 Stan 代表 Vertica 与 Mike 经常出差拜访公司。然而,当时我并不知道这一点,因此我不确定是否要转而让斯坦作为我的顾问。但我很好奇斯通布雷克到底有什么大惊小怪的,所以我同意至少参加第一次启动会议。就在那时我遇见了迈克。

At first, I was hesitant to do this. After I decided to leave Wisconsin to start graduate school at Brown, I had a parting conversation with DeWitt. The last thing that he said to me was that I should not work with Stan Zdonik. He never told me why. I later learned it was because Stan was traveling a lot to visit companies on behalf of Vertica with Mike. At the time, however, I did not know this, and thus I was not sure about switching to have Stan as my adviser. But I was curious to see what all the fuss was about regarding Stonebraker, so I agreed to at least attend the first kick-off meeting. That was when I met Mike.

正如我现在所描述的,H-Store 系统作为一个学术项目有几个版本,最终是 VoltDB 商业产品(参见第28 章)。

As I now describe, there were several incarnations of the H-Store system as an academic project and eventually the VoltDB commercial product (see Chapter 28).

系统架构概述

System Architecture Overview

H-Store 项目处于 DBMS 架构新运动的前沿,称为NewSQL [Pavlo 和 Aslett 2016]。在 2000 年代后期,DBMS 最热门的趋势是所谓的NoSQL系统,它放弃传统 DBMS(例如 Oracle、DB2、Postgres)的 ACID(原子性、一致性、隔离性、持久性)保证,以换取更好的可扩展性和可用性。NoSQL 支持者认为,SQL 和事务是实现现代操作、在线事务处理 (OLTP) 应用程序所需高性能的限制。H-Store 的与众不同之处在于,它寻求在不放弃传统 DBMS 的事务保证的情况下实现 NoSQL 系统性能的提高。

The H-Store project was at the forefront of a new movement in DBMS architectures called NewSQL [Pavlo and Aslett 2016]. During the late 2000s, the hottest trend in DBMSs were the so-called NoSQL systems that forego the ACID (Atomicity, Consistency, Isolation, Durability) guarantees of traditional DBMSs (e.g., Oracle, DB2, Postgres) in exchange for better scalability and availability. NoSQL supporters argued that SQL and transactions were limitations to achieving the high performance needed in modern operational, on line transaction processing (OLTP) applications. What made H-Store different was that it sought to achieve the improved performance of NoSQL systems without giving up the transactional guarantees of traditional DBMSs.

Mike 的主要观察结果是,当时现有的 DBMS 基于 20 世纪 70 年代的原始系统架构,对于这些工作负载而言过于繁重 [Harizopoulos 等人。2008]。此类 OLTP 应用程序的特点是包含许多事务,这些事务 (1) 是短暂的(即,没有用户停顿),(2) 使用索引查找触及一小部分数据(即,没有全表扫描或大型分布式联接), (3) 是重复的(即使用不同的输入执行相同的查询)。

Mike’s key observation was that existing DBMSs at that time were based on the original system architectures from the 1970s that were too heavyweight for these workloads [Harizopoulos et al. 2008]. Such OLTP applications are characterized as comprising many transactions that (1) are short-lived (i.e., no user stalls), (2) touch a small subset of data using index lookups (i.e., no full table scans or large distributed joins), and (3) are repetitive (i.e., executing the same queries with different inputs).

H-Store 是一个并行的行存储关系型 DBMS,运行在无共享的主内存执行器节点集群上。大多数 OLTP 应用程序都足够小,可以完全放入内存中。这使得 DBMS 可以使用更轻量级的架构组件,因为它们不认为系统必须停止才能从磁盘读取数据。数据库被划分为不相交的子集,这些子集被分配给单线程执行引擎,该引擎被分配给节点上的一个且仅一个核心。每个引擎都有独占访问权其分区中的所有数据。因为它是单线程的,所以一次只有一个事务可以访问存储在其分区中的数据。因此,系统中不存在逻辑锁或低级锁存器,并且事务一旦启动就不会停止等待另一个事务。这也意味着所有事务都必须作为存储过程执行,以避免由于 DBMS 和应用程序之间的网络往返而造成的延迟。

H-Store is a parallel, row-storage relational DBMS that runs on a cluster of shared-nothing, main memory executor nodes. Most OLTP applications are small enough to fit entirely in memory. This allowed the DBMS to use architectural components that were more lightweight because they did not assume that the system would ever have to stall to read data from disk. The database is partitioned into disjoint subsets that are assigned to a single-threaded execution engine that is assigned to one and only one core on a node. Each engine has exclusive access to all the data in its partition. Because it is single-threaded, only one transaction at a time can access the data stored at its partition. Thus, there are no logical locks or low-level latches in the system, and no transaction will stall waiting for another transaction once it is started. This also means that all transactions had to execute as stored procedures to avoid delays due to network round trips between the DBMS and the application.

H-Store 最初设计中的另一个想法是根据交易在执行过程中访问的数据将交易分为不同的组。单站点事务(后来重新标记为单分区)是仅访问单个分区中的数据的事务。这是 H-Store 模型下事务的理想场景,因为它不需要分区之间的任何协调。它要求对应用程序的数据库进行分区,使事务中一起使用的所有数据都驻留在同一分区中(在原始 H-Store 论文中称为约束树模式 [Kallman et al. 2008] 。另一种交易类型,称为一次性交易,就是将一个事务分解为多个不需要相互协调的单分区事务。最后一种类型称为一般事务,是指事务访问任意数量的分区。这些事务可以包含访问多个分区的单个查询,也可以包含每个访问不同分区的多个查询。一般交易是H-Store模型最坏的情况。

Another idea in the original design of H-Store was classifying transactions into different groups based on what data they accessed during execution. A single-sited transaction (later relabeled as single-partition) was one that only accessed data at a single partition. This was the ideal scenario for a transaction under the H-Store model as it did not require any coordination between partitions. It requires that the application’s database be partitioned in such a way that all the data that is used together in a transaction reside in the same partition (called a constrained tree schema in the original H-Store paper [Kallman et al. 2008]). Another transaction type, called one shot, is where a transaction is decomposed into multiple single-partition transactions that do not need to coordinate with each other. The last type, known as a general transaction, is when the transaction accesses an arbitrary number of partitions. These transactions can contain either a single query that accesses multiple partitions or multiple queries that each access disparate partitions. General transactions are the worst-case scenario for the H-Store model.

第一个原型 (2006)

First Prototype (2006)

H-Store 系统的第一个版本是由 Daniel Abadi 和 Stavros Harizopoulos 以及 Sam Madden 和 Stonebraker 在麻省理工学院为他们的 VLDB 2007 论文“建筑时代的终结:(现在是完全重写的时候了”)构建的原型 [Stonebraker 等al. 2007b]。这个早期系统的构建目的是仅在数据数组上执行 TPC-C 的硬编码版本。它使用简单的 B+Tree 作为索引。它不支持任何日志记录或 SQL。

The first version of the H-Store system was a prototype built by Daniel Abadi and Stavros Harizopoulos with Sam Madden and Stonebraker at MIT for their VLDB 2007 paper “The End of an Architectural Era: (It’s Time for a Complete Rewrite” [Stonebraker et al. 2007b]. This early system was built to only execute a hardcoded version of TPC-C over arrays of data. It used a simple B+Tree for indexes. It did not support any logging or SQL.

MIT H-Store 原型能够实现比专业人士调整的 Oracle 安装高 82 倍的吞吐量。这证明 H-Store 的设计避开了面向磁盘的系统的遗留架构组件,对于 OLTP 工作负载来说是一种很有前途的方法。

The MIT H-Store prototype was able to achieve 82× better throughput than an Oracle installation tuned by a professional. This was evidence that H-Store’s design of eschewing the legacy architecture components of disk-oriented systems was a promising approach for OLTP workloads.

第二个原型(2007-2008)

Second Prototype (2007–2008)

鉴于第一个原型的成功,迈克决定继续 H-Store 的想法,并构建一个新的、功能齐全的系统,作为麻省理工学院、布朗大学和耶鲁大学(丹后来作为新教员加入)之间合作的一部分)。这是当我刚刚开始读研究生并参与该项目时。第一次 H-Store 会议于 2007 年 11 月在麻省理工学院举行。

Given the success of the first prototype, Mike decided to continue with the H-Store idea and build a new, full-featured system as part of a collaboration among MIT, Brown, and Yale (which Dan had since joined as a new faculty member). This was when I had just started graduate school and gotten involved in the project. The first H-Store meeting was held at MIT in November 2007.

我对第一次会面的记忆是,迈克以我现在理解的他典型的方式“主持法庭”。他靠在椅子上,伸出长腿,双手放在脑后。然后,他阐述了他对该系统的整个愿景:我们(即学生)需要构建哪些组件以及我们应该如何构建它们。作为一名新研究生,我认为这是福音,并开始写下他所说的一切。

My recollection from that first meeting was that Mike “held court” in what I now understand is his typical fashion. He leaned back in his chair with his long legs stretched out and his hands placed behind his head. He then laid out his entire vision for the system: what components that we (i.e., the students) needed to build and how we should build them. As a new graduate student, I thought that this was gospel and proceeded to write down everything that he said.

直到今天,关于这次会议,我仍然印象深刻的一件事是 Mike 如何通过其发明者或论文作者的名字来引用 DBMS 组件。例如,Mike 说我们应该构建一个“Selinger 式”查询优化器,并且应该避免使用“Mohan 式”恢复方案。这很令人恐惧:我从本科数据库课程中知道了这些概念,但从未阅读过原始论文,因此不知道这些概念的发明者是谁。在那次会议之后,我确保我阅读了他提到的所有论文。

One thing that still stands out for me even to this day about this meeting was how Mike referred to DBMS components by the names of their inventors or paper authors. For example, Mike said that we should build a “Selinger-style” query optimizer and that we should avoid using a “Mohan-style” recovery scheme. This was intimidating: I knew of these concepts from my undergraduate database course but had never read the original papers and thus did not know who the concepts’ inventors were. I made sure after that meeting that I read all the papers that he mentioned.

我们的布朗大学和麻省理工学院的学生团队于 2007 年底开始构建该系统的第二个版本。我们无法从原始原型中重用任何内容,因为它是一个概念验证(即,它被硬编码为仅执行 TPC- C 事务并将所有元组存储在长数组中),因此我们必须从头开始编写整个系统。我的任务是构建内存存储管理器和执行引擎。Hideaki Kimura 是布朗大学的硕士生(后来成为博士生),他为我提供了帮助。埃文·琼斯也是一位新博士。麻省理工学院的学生,加入团队实现系统的网络层。

Our team of Brown and MIT students started building this second version of the system in late 2007. There was nothing that we could reuse from the original prototype since it was a proof-of-concept (i.e., it was hardcoded to only execute TPC-C transactions and store all tuples in long arrays), so we had to write the entire system from scratch. I was tasked with building the in-memory storage manager and execution engine. Hideaki Kimura was am M.S. student at Brown (later a Ph.D. student) who was helping me. Evan Jones was also a new Ph.D. student at MIT who joined the team to implement the system’s networking layer.

在该项目的最初几个月里,布朗大学和麻省理工学院的团队举行了几次会议。我记得有很多次,学生和其他教授会就系统的某些设计细节发生争执。迈克在这些讨论中保持沉默。然后,过了一段时间,他会用一段长篇演讲打断,开头是“在我看来……”他会继续澄清一切,让会议回到正轨。迈克具有解决复杂问题并提出简洁解决​​方案的强大能力。而且他几乎每次都是正确的。

There were several meetings between the Brown and MIT contingents in these early months of the project. I remember that there were multiple times when the students and other professors would squabble about certain design details of the system. Mike would remain silent during these discussions. Then, after some time, he would interrupt with a long speech that started with the phrase “seems to me .…” He would proceed to clarify everything and get the meeting back on track. Mike has this great ability to cut through complex problems and come up with a pithy solution. And he was correct almost every time.

为了帮助减少我们必须编写的代码量,Mike 和其他教授建议我们尝试借用其他开源 DBMS 的组件。Hideaki 和我研究了 SQLite、MySQL 和 Postgres。由于我不记得的原因,我们倾向于使用 MySQL。我记得我在麻省理工学院首届新英格兰数据库日上与某人谈论了我的计划。此人后来写了一篇慷慨激昂的博客文章,其中他恳求由于 MySQL 的某些设计问题,我们不要使用 MySQL。鉴于此,我们决定从 Postgres 借用一些东西。我们的计划是使用 Postgres 解析 SQL 查询,然后提取查询计划。然后我们将在我们的引擎中执行这些计划(这是在 Postgres 的外部数据包装器之前)。Dan Abadi 让耶鲁大学的一名硕士生实现了将 Postgres 计划转储为 XML 的能力。正如我下面所描述的,我们后来放弃了在 H-Store 中使用 Postgres 代码的想法。

To help reduce the amount of code that we had to write, Mike and the other professors suggested that we try to borrow components from other open-source DBMSs. Hideaki and I looked at SQLite, MySQL, and Postgres. For reasons that I do not remember, we were leaning toward using MySQL. I remember that I talked to somebody at the inaugural New England Database Day at MIT about my plans to do this. This person then later wrote an impassioned blog article where he pleaded for us not to use MySQL due to certain issues with its design. Given this, we then decided to borrow pieces from Postgres. Our plan was to use Postgres to parse the SQL queries and then extract the query plans. We would then execute those plans in our engine (this was before Postgres’ Foreign Data Wrappers). Dan Abadi had an M.S. student at Yale implement the ability to dump out a Postgres plan to XML. As I describe below, we would later abandon the idea of using Postgres code in H-Store.

2008 年 3 月,Evan 和我写了 H-Store VLDB 演示论文。那时我们仍然没有一个功能齐全的系统。该论文中的单个屏幕截图是我们从未最终实现的系统控制面板的模型。大约在这个时候,John Hugg 被 Vertica 聘用,开始构建 H-Store 的商业版本(参见第 28 章)。这最初被称为Horizo​​ntica。John 正在构建一个基于 Java 的前端层。他修改了 HSQLDB(超 SQL 数据库)DBMS 以发出 XML 查询计划,然后定义了存储过程 API。他没有执行引擎。

In March 2008, Evan and I wrote the H-Store VLDB demo paper. We still did not have a fully functioning system at that point. The single screenshot in that paper was a mock-up of a control panel for the system that we never ended up implementing. Around this time, John Hugg was hired at Vertica to start building the commercial version of H-Store (see Chapter 28). This was originally called Horizontica. John was building a Java-based front-end layer. He modified the HSQLDB (Hyper SQL Database) DBMS to emit XML query plans, and then he defined the stored procedure API. He did not have an execution engine.

VLDB 演示论文于 2008 年春末被接受 [Kallman 等人。2008]。此时我们仍然只有未集成的单独组件。迈克说了一些话,大意是鉴于这篇论文即将发表,我们最好建造这个该死的东西。因此,决定 H-Store 学术团队将与 Horizo​​ntica 团队(此时只有 John Hugg 和 Bobbi Heath)联手。我们将 Horizo​​ntica 中 John 的 Java 层与我们的 H-Store C++ 执行引擎混合在一起。我们最终没有使用 Evan 的 C++ 网络层代码。

The VLDB demo paper got accepted in late Spring 2008 [Kallman et al. 2008]. At this point we still only had separate components that were not integrated. Mike said something to the effect that given that the paper was going to be published, we had better build the damn thing. Thus, it was decided that the H-Store academic team would join forces with the Horizontica team (which at this point was just John Hugg and Bobbi Heath). We mashed together John’s Java layer in Horizontica with our H-Store C++ execution engine. We ended up not using Evan’s C++ networking layer code.

2008 年夏季,有几个人在该系统上工作,为 8 月底在新西兰举行的 VLDB 演示做准备。约翰、秀明和我负责核心系统。Hideaki 和我被聘为 Vertica 实习生来处理知识产权问题,但我们仍然在布朗大学的研究生办公室工作。Evan 与 MIT 的一名本科生合作编写了 TPC-C 基准测试实现。Bobbi Heath 被任命为团队的工程经理;她是 Mike 之前的初创公司 (StreamBase) 的前雇员。Bobbi 最终聘请了 Ariel Weisberg 和 Ryan Betts 2来帮助开发,但那是在今年晚些时候。

There were several people working on the system during Summer 2008 to prepare for the VLDB demo in New Zealand at the end of August. John, Hideaki, and I worked on the core system. Hideaki and I were hired as Vertica interns to deal with IP issues, but we still worked out of our graduate student offices at Brown. Evan worked with an undergrad at MIT to write the TPC-C benchmark implementation. Bobbi Heath was brought in as an engineering manager for the team; she was a former employee at Mike’s previous startup (StreamBase). Bobbi eventually hired Ariel Weisberg and Ryan Betts2 to help with development, but that was later in the year.

到夏末,我们有了一个功能正常的 DBMS,它使用基于启发式的查询优化器(即,不是 Mike 想要的“Selinger 式”优化器!)来支持 SQL 和存储过程。我和 John Hugg 被选中参加这次会议,演示 H-Store 系统。虽然我不记得了目前的系统是否可以支持更大的集群规模,我们的演示只有两个节点,因为我们必须带着笔记本电脑飞往新西兰。会议组织者在会议前一个月左右发送了一封电子邮件,确认将为每位演示与会者提供一个大屏幕。在此之前,我们还没有决定要在演示中展示什么。约翰和我很快构建了一个简单的可视化工具,可以显示实时速度计,显示系统每秒可以执行多少笔交易。

By the end of the summer, we had a functioning DBMS that supported SQL and stored procedures using a heuristic-based query optimizer (i.e., not the “Selinger-style” optimizer that Mike wanted!). John Hugg and I were selected to attend the conference to demonstrate the H-Store system. Although I do not remember whether the system at this time could support larger cluster sizes, our demo had only two nodes because we had to fly with the laptops to New Zealand. The conference organizers sent an email about a month before the conference confirming that each demo attendee would be provided a large screen. Before this, we had not decided what we were going to show in the demo. John and I quickly built a simple visualization tool that would show a real-time speedometer of how many transactions per second the system could execute.

我记得演示前一天晚上,约翰和我在他的酒店房间里调试 DBMS。当我们终于能够让系统长时间运行而不崩溃时,我们感到非常兴奋。我相信我们的峰值吞吐量约为每秒 6,000 个 TPC-C 事务。按照今天的标准,这听起来确实不算多,但当时 MySQL 和 Oracle 每秒可以分别执行大约 300 和 800 个事务。这个速度足够快,笔记本电脑会在大约一分钟内耗尽内存,因为 TPC-C 会向数据库中插入大量新记录。约翰必须编写一个特殊的事务来定期检查并删除旧的元组。

I remember that John and I were debugging the DBMS the night before the demo in his hotel room. We got such a thrill when we were finally able to get the system to run without crashing for an extended period. I believe that our peak throughput was around 6,000 TPC-C transactions per second. This certainly does not sound like a lot by today’s standards, but back then MySQL and Oracle could do about 300 and 800 transactions per second, respectively. This was fast enough that the laptops would run out of memory in about a minute because TPC-C inserts a lot of new records into the database. John had to write a special transaction that would periodically go through and delete old tuples.

VoltDB(2009 年至今)

VoltDB (2009–Present)

VLDB 演示成功后,我们于 2008 年 9 月举行了庆祝晚宴。正是在这里,Mike 宣布他们将成立一家新公司来商业化 H-Store。John Hugg 和公司工程师分叉了 H-Store 代码,并着手消除 VLDB 演示中的各种黑客行为。我记得他们就如何命名新系统进行了长时间的讨论。他们聘请了一家营销公司。我认为他们想出的第一个名字是“续集”。据说除了迈克之外每个人都讨厌它。他认为这对甲骨文来说是一次很好的双关语。然后他们聘请了另一家营销公司,该公司推出了 VoltDB(“Vertica 在线交易数据库”)。大家都喜欢这个名字。

After the successful VLDB demo, we had a celebration dinner in September 2008. It was here that Mike announced that they were going to form a new company to commercialize H-Store. John Hugg and the company engineers forked the H-Store code and set about removing the various hacks that we had for the VLDB demo. I remember that they had a long discussion about what to name the new system. They hired a marketing firm. I think the first name they came up with was “The Sequel”. Supposedly everyone hated it except for Mike. He thought that it would be a good punny jab at Oracle. Then they hired another marketing firm that came up with VoltDB (“Vertica On-Line Transaction Database”). Everyone liked this name.

我对 VoltDB 早期最难忘的记忆是 2009 年 10 月高性能交易系统会议结束后,我与 Mike、Evan、John 和 Bobbi 一起参观了位于圣何塞的 PayPal。PayPal 对 VoltDB 感兴趣,因为他们已经达到了整体 Oracle 安装的极限。埃文和我只是作为观察者在那里。我们在 PayPal 的东道主是迈克的热心支持者。在与其他工程总监举行大型会议之前,这个人用了几分钟的时间讲述了他如何阅读迈克的所有论文以及他是多么喜欢其中的每一个字。迈克似乎对此一点也不担心。我现在意识到这可能就是我第一次见到迈克时的反应。

My most noteworthy memory of the early VoltDB days was when I visited PayPal in San Jose with Mike, Evan, John, and Bobbi in October 2009 after the High Performance Transaction Systems conference. PayPal was interested in VoltDB because they were reaching the limits of their monolithic Oracle installation. Evan and I were there just as observers. Our host at PayPal was an ardent Mike supporter. Before the large meeting with the other engineering directors, this person went on for several minutes about how he had read all of Mike’s papers and how much he loved every word in them. Mike did not seem concerned by this in the least. I now realize that this is probably how I reacted the first time I met Mike.

H-Store/VoltDB 拆分(2010–2016)

H-Store/VoltDB Split (2010–2016)

2009 年,在短暂中断与 Mike 和 DeWitt 一起撰写 MapReduce 评估论文之后 [Stonebraker 等人。2010],我回到H-Store工作。我最初的计划是在研究生院的剩余时间里使用 VoltDB 作为我研究的目标平台,因为当时 VoltDB 已经有几个人在研究这个系统,而且它是开源的。

After a brief hiatus to work on a MapReduce evaluation paper with Mike and DeWitt in 2009 [Stonebraker et al. 2010], I went back to work on H-Store. My original plan was to use VoltDB as the target platform for my research for the rest of my time in graduate school, as VoltDB had several people working on the system by then and it was open-source.

VoltDB 团队对原始 H-Store 代码库添加的增强功能于 2010 年夏天合并回 H-Store 存储库。随着时间的推移,VoltDB 的各种组件已在 H-Store 中删除和重写,以满足我的研究需求。但我最终不得不重写很多 VoltDB 代码,因为它没有做我需要的事情。最值得注意的是它不支持任意多分区事务。

The enhancements to the original H-Store codebase added by the VoltDB team were merged back into the H-Store repository in the summer of 2010. Over time, various components of VoltDB have been removed and rewritten in H-Store to meet my research needs. But I ended up having to rewrite a lot of the VoltDB code because it did not do the things that I needed. Most notable was that it did not support arbitrary multi-partition transactions.

在此期间,Mike 一直催促我将我的工作移植到 VoltDB,但我这样做是不可行的。但在 2012 年,Mike 回来想要为 H-Store 的反缓存项目工作 [Harizopoulos 等人。2008]。然后,我们为系统的弹性版本(E-Store [Taft et al. 2014a])和流版本(S-Store [Çetintemel et al. 2014])扩展了 H-Store 代码。

Mike was always pushing me to port my work to VoltDB during this period, but it just was not feasible for me to do this. But in 2012, Mike came back to want to work on H-Store for the anti-caching project [Harizopoulos et al. 2008]. We then extended the H-Store code for the elastic version of the system (E-Store [Taft et al. 2014a]) and the streaming version (S-Store [Çetintemel et al. 2014]).

结论

Conclusion

经过十年的发展,H-Store项目于2016年结束。与大多数学术项目相比,它对研究界和年轻人的职业生涯的影响是巨大的。来自多所大学(麻省理工学院、布朗大学、卡内基梅隆大学、耶鲁大学、加州大学圣巴巴拉分校)的许多学生(三名本科生、九名硕士、九名博士)为该项目做出了贡献。一些博士。学生们纷纷前往顶尖大学(卡内基梅隆大学、耶鲁大学、西北大学、芝加哥大学)担任教职。它还拥有来自研究实验室(英特尔实验室、QCRI)的合作者。在此期间,迈克获得了图灵奖。

The H-Store project ended in 2016 after ten years of development. Compared to most academic projects, its impact on the research community and the careers of young people was immense. There were many students (three undergraduates, nine master’s, nine Ph.Ds.) who contributed to the project from multiple universities (MIT, Brown, CMU, Yale, University of California Santa Barbara). Some of the Ph.D. students went off to get faculty positions at top universities (CMU, Yale, Northwestern, University of Chicago). It also had collaborators from research labs as well (Intel Labs, QCRI). During this time Mike won the Turing Award.

在该系统工作了十年之后,我很高兴地说,迈克对系统设计的所有设想(几乎)是正确的。更重要的是,他关于 SQL 和事务是操作 DBMS 的重要组成部分的预测是正确的。当H-Store项目开始时,趋势是使用不支持SQL或事务的NoSQL系统。但现在几乎所有 NoSQL 系统都已切换到 SQL 和/或添加了对 SQL 的基本支持。

After working on the system for a decade, I am happy to say that Mike was (almost) correct about everything he envisioned for the system’s design. More important is that his prediction that SQL and transactions are an important part of operational DBMSs was correct. When the H-Store project started, the trend was to use a NoSQL system that did not support SQL or transactions. But now almost all the NoSQL systems have switched over to SQL and/or added basic support for SQL.

1 . http://dl.acm.org/itation.cfm?id=1453856.1453865(上次访问时间:2018 年 3 月 26 日。)

1. http://dl.acm.org/citation.cfm?id=1453856.1453865 (Last accessed March 26, 2018.)

2 . 2012 年 Mike 辞职后,Ryan 继续担任 VoltDB 的 CTO。

2. Ryan later went on to become to the CTO of VoltDB after Mike stepped down in 2012.

20

20

攀登高山:SciDB 和科学数据管理

Scaling Mountains: SciDB and Scientific Data Management

保罗·布朗

Paul Brown

因为它就在那里

Because it’s there.

——乔治·马洛里

—George Mallory

“连续登顶”并不是很多人所渴望的成就。

“Serial summiteer” isn’t an achievement many people aspire to.

这个想法很简单。选择一组具有某些共同特征的山峰——七大洲最高的山峰、科罗拉多州 14,000 英尺高的山峰、苏格兰的 282 座“Munros”山峰,甚至只是新英格兰的最高峰——然后攀登他们一个接一个地爬,直到没有人可以攀爬。相比之下,“连续创业者”是一个处于理想范围另一端的标签。任何负责创建一系列公司(每个公司都有自己的技术和业务目标)的个人至少值得一本书。在“连续登顶者”和“连续创业者”的相对受欢迎程度之间,我们可能会发现“成功的学术”:一种以研究项目、高成就毕业生和发表论文来衡量的职业。所有这些努力的共同点是需要规划,

The idea is simple enough. Pick a set of mountain peaks all sharing some defining characteristic—the tallest mountains on each of the seven continents, the 14,000-ft mountains of Colorado, the 282 “Munros” in Scotland, or even just the highest peaks in New England—and climb them, one after the other, until there are none left to climb. “Serial entrepreneur,” by contrast, is a label at the opposite end of the aspirational spectrum. Any individual responsible for founding a series of companies, each with its own technical and business goals, is worth at least a book. Somewhere between the relative popularity of “serial summiteer” and “serial entrepreneur” we might find “successful academic”: a career measured in terms of research projects, high-achieving graduates, and published papers. What all of these endeavors share is the need for planning, careful preparation, considerable patience, and stamina, but above all, stoic determination.

迈克·斯通布雷克(Mike Stonebraker)经历了所有这些。新罕布什尔州有 48 座海拔超过 4,000 英尺的山峰,他都攀登过。他创办了许多初创公司。他还获得了图灵奖。在本章中,我们讲述一个“登山之旅”的故事:我有幸成为该项目的成员。我们的探险被认为是一个研究科学数据管理的研究项目,但不得不转型为一家商业初创公司。攀登还没有结束。没有人知道登山者在继续攀登时会发现什么。然而今天,Paradigm4 公司继续构建和销售 SciDB 阵列 DBMS,并取得了相当大的成功。

Mike Stonebraker has lived all of them. There are 48 mountains in New Hampshire over 4,000 ft in height and he’s climbed them all. He’s founded many start-up companies. And he won a Turing Award. In this chapter, we tell the story of one “journey up a mountain”: a project I had the privilege to be a member of. Our expedition was conceived as a research project investigating scientific data management but was obliged to transform itself into a commercial start-up. The climb’s not over. No one knows what the climbers will find as they continue their ascent. Yet today, the company Paradigm4 continues to build and sell the SciDB Array DBMS with considerable success.

选择您的山峰1

Selecting Your Mountain1

人对于岩石和山来说算什么?

What are men to rocks and mountains?

-简·奥斯汀

–Jane Austen

是什么让一座山变得有趣?这是一个令人惊讶的困难问题。从每年出发的科技初创公司的数量来看,每家公司都艰难地走向自己精心挑选的大本营,关于哪些问题值得解决的问题的看法差异很大。对于一些登山者来说,这是技术挑战。对于其他人来说,这是探索新地形的满足感,或者是站在顶峰的兴奋感。当然,由于任何探险都需要资金,因此对于一些成员来说,遥远山顶上闪烁的光芒实际上是丰富的金矿。

What makes a mountain interesting? This is a surprisingly difficult question. And to judge by the sheer number of technology startups that set off annually, each trudging to its own carefully selected base camp, opinions about what problems are worth solving vary tremendously. For some climbers, it’s the technical challenge. For others, it’s the satisfaction of exploring new terrain, or the thrill of standing on the peak. And, of course, because any expedition needs funding, for some members it’s the prospect that a glint of light on some remote mountaintop is actually a rich vein of gold.

没有一次探险会真正失败。或者,至少,它们很少对每个人来说都是失败的。有时,技术挑战被证明太困难了——在这种情况下,探险队未能实现其目标;或者太容易了——在这种情况下,到达山顶时,我们会发现山顶挤满了四肢摊开的野餐者,正在享用三明治和雪利酒。有时,新地形会显得枯燥、贫瘠且毫无特色——在这种情况下,探险队会发现自己拥有一些没有兴趣或内在价值的东西。其他时候,金光闪闪的结果只是雪地上的一道阳光,尽管所有技术挑战都已解决,尽管登山者表现出了英雄气概,探险队还是一文不名地下山了。

No expedition ever really fails. Or, at least, they’re rarely a failure for everyone. Sometimes the technical challenge proves too difficult—in which case the expedition falls short of its goals—or too easy—in which case it arrives to find a summit crowded with sprawled picnickers enjoying sandwiches and sherry. Sometimes, the new terrain reveals itself to be dull, barren, and featureless—in which case the expedition finds itself in possession of something of no interest or intrinsic value. Other times, the glint of gold turns out to be just a flash of sun on snow, and even though every technical challenge has been met and in spite of the climbers’ heroism, the expedition descends dead broke.

因此,一个成功的连续创业者或职业学者必须对山有很好的品味。有趣的是,当回顾迈克多年来的选择时,他的品味是多么的好。每隔几年就会出现一个新面孔,提出一个“伟大而新颖”的想法,从而颠覆整个行业,这已成为数据管理行业的一个老笑话。但经过检查,“伟大”的部分并不新颖,而“新颖”的部分也没有那么伟大。迈克的职业生涯证​​明了他能够选择正确的问题。如果要选择将关系上层结构固定在分层或网络存储层上,还是面对构建全新事物的挑战,我们得到了 InGRes。当面向对象语言的勇敢新世界用“阻抗不匹配”来挑战 SQL,并承诺用“指针调配”来实现性能魔法时,Mike 转而采用基本关系模型,通过在关系模型的概念上进行开发来满足新的需求。域包含新类型(用户定义类型 - UDT)、用户定义函数 (UDF) 和聚合 (UDA) 由于 Hadoop 供应商一直在放弃 HDFS 和 MapReduce 根源,转而支持并行性、列存储和 SQL,Mike 对 MapReduce 的直言不讳现在看来是有道理的。

Therefore, a successful serial entrepreneur or career academic must have very good taste in mountains. It is interesting, when reflecting on Mike’s choices over the years, just how good his taste has been. With a regularity that has become a standing joke in the data management industry, every few years a fresh face comes along with a “great and novel” idea that will upend the industry. But on examination, the “great” bits aren’t novel and the “novel” bits don’t turn out to be all that great. Mike’s career is a testament to picking the right problems. Given a choice between bolting a relational super-structure on a hierarchical or network storage layer or confronting the challenge of building something entirely new, we got InGReS. When the brave new world of object-oriented languages challenged SQL with talk of an “impedance mismatch” and promised performance magic with “pointer swizzling,” Mike went instead with adapting the basic relational model to serve new requirements by developing on the relational model’s concept of a domain to encompass new types (user-defined types—UDTs), user-defined functions (UDFs), and aggregates (UDAs). With Hadoop vendors all well along the path of abandoning their HDFS and MapReduce roots in favor of parallelism, column stores, and SQL, Mike’s vocal aversion to MapReduce now seems justified.

以SciDB为代表的山的有趣之处在于对“科学”数据的管理。早在 20 世纪 90 年代初,当 Mike 担任 Sequoia 2000 项目的联合 PI [Stonebraker 1995] 时,很明显,存储、组织和查询由大气计算机模型、卫星和其他设备生成的数据所面临的挑战传感器网络表明,所提供的数据管理技术与科学分析师和最终用户的需求之间存在巨大差距。许多科学数据库都是退化的“一次写入”磁带库。用户在数据访问方面只能靠自己。数据对象被分配了语义上无意义的标签,迫使用户费劲浏览 FTP 服务器上的目录文件来识别他们想要的内容。自 20 世纪 90 年代以来,迈克的朋友、已故的吉姆·格雷 (Jim Gray) 他还将相当一部分精力投入到科学数据管理问题上,并在斯隆数字巡天 (SDSS) 的设计和实施方面取得了成功 [Szalay 2008]。SDSS 对天文学数据的建模和管理方式以及天文学科学的开展方式产生了相当大的影响。

The interesting thing about the mountain represented by SciDB was the management of “scientific” data. As far back at the early 1990s, when Mike served as co-PI of the Sequoia 2000 Project [Stonebraker 1995], it was clear that the challenges of storing, organizing, and querying data generated by things like atmospheric computer models, satellites, and networks of sensors indicated that there were profound gaps between the data management technology on offer and what scientific analysts and end users wanted. Many science databases were degenerate “write once” tape repositories. Users were on their own when it came to data access. Data objects were assigned semantically meaningless labels, forcing users to struggle through catalog files on FTP servers to identify what they wanted. Since the 1990s, Mike’s friend, the late Jim Gray, had also devoted considerable portions of his considerable energy to the problem of scientific data management and had met with success in the design and implementation of the Sloan Digital Sky Survey (SDSS) [Szalay 2008]. SDSS had considerable impact in terms of how astronomy data was modeled and managed and indeed, on how the science of astronomy was conducted.

今天,我们所管理的数据的特征以及我们处理数据的性质正面临着巨大变化的门槛。数字信号数据(通常与物联网 (IoT) 相关的机器生成或传感器驱动的数据)与推动 DBMS 技术的最后一波以人为中心、事件驱动的业务数据完全不同。现代精准医疗应用程序汇集了基因组学数据、生物医学成像数据、可穿戴设备数据(医疗物联网)、来自 MRI 和 EKG 的医疗监测数据、自由文本注释以及更规则的结构化患者临床数据和人口统计详细信息。但大量数据和应用于数据的分析方法,以及完全主导工作负载的应用程序方面,都可以被认为是“科学的”。现代数据分析涉及生成假设、对数据进行建模以对其进行测试、从这些模型中得出可行的见解,然后用新的想法重新开始该过程。事实上,这种变化已经产生了一个全新的计算机用户类别:“数据科学家”,他采用任何科学家都熟悉的方法从机器记录的内容中梳理出见解。

We stand today on the threshold of vast changes in the character of the data we are being called on to manage and the nature of what we do with it. Digital signal data—the kind of machine-generated or sensor-driven data commonly associated with the Internet of Things (IoT)—is utterly different from the human-centered, event-driven business data that drove the last great waves of DBMS technology. A modern precision medicine application brings together genomics data, biomedical imaging data, wearables data (the Medical IoT), medical monitoring data from MRIs and EKGs, and free text annotations alongside more regularly structured patient clinical data, and demographics details. But the bulk of the data and the analytic methods applied to it, the aspects of the application that utterly dominate the workload, can all be thought of as “scientific.” Modern data analysis involves generating hypotheses, modeling the data to test them, deriving actionable insights from these models, and then starting the process over again with fresh ideas. Indeed, this change has yielded an entirely new class of computer users: the “data scientist,” who employs methods that would be familiar to any working scientist to tease out insight from what the machines record.

管理和分析科学数据“很有趣”。它立即应用于科学研究。这对未来的商业用途构成了一系列的技术挑战和要求。这是一座值得探险的山。

Managing and analyzing scientific data was “interesting.” It had immediate application to scientific research. And it constituted a set of technical challenges and requirements with future commercial uses. It was a mountain worthy of an expedition.

规划攀登

Planning the Climb

当你从事山地救援时,你不会获得山地救援博士学位;你会获得山地救援博士学位。你要找一个了解地形的人

When you’re doing mountain rescue, you don’t take a doctorate in mountain rescue; you look for someone who knows the terrain.

——罗里·斯图尔特

—Rory Stewart

你永远不知道在山上会遇到什么危险和惊喜。因此,在出发之前尽可能了解有关它的一切确实很有帮助。

You never know exactly what dangers and surprises you’re going to find on the mountain. So, it really helps to learn everything you can about it before you set out.

要了解科学数据管理的问题,要询问的人是正在工作的科学家。到 2000 年代末,多种想法汇聚在一起。

To understand the problems of scientific data management, the people to ask are working scientists. By the late 2000s, several ideas were coming together.

首先,科学的本质已经从实验室外套和试管驱动转变为完全依赖计算机进行建模、数据管理和分析。任何规模的科学项目总是包括一群充当夏尔巴人的程序员。

First, the coalface of science had shifted from being lab-coats-and-test-tubes-driven to becoming utterly dependent on computers for modeling, data management, and analysis. Scientific projects of any size always included a cohort of programmers who served as Sherpas.

其次,由于传感器技术的发展,下一代科学项目产生的数据规模可以合理地描述为爆炸式增长。CERN(目前是世界上最大的原始科学数据生成器)的探测器每天生成约 100 TB 数据,并存储数十 PB 的近期历史数据。但只有一个拟在未来十年部署的望远镜——大型综合巡天望远镜——每晚将收集约 20 TB 的原始数据,并且必须将其每一点都存储十年,最终积累约 100 PB。

Second, as a consequence of developments in sensor technology, the scale of data produced by the next generation of scientific projects could reasonably be characterized as an explosion. At CERN—currently the world’s largest generator of raw scientific data—the detectors generate about 100 terabytes a day and store tens of petabytes of recent history. But just one of the telescopes proposed for deployment in the next decade—the Large Synoptic Survey Telescope—will collect about 20 terabytes of raw data per night and must store every bit of it for ten years, eventually accumulating about 100 petabytes.

第三,SDSS 项目的成功(很大程度上依赖于 Microsoft SQL Server)似乎表明可以使用通用数据管理工具和技术来支持专门的科学任务。如果要在从头开始构建某些东西或重复使用别人的组件甚至整个系统之间做出选择,工作科学家的“生活偏好”始终是推出自己的——这是一种昂贵且容易出现风险的方法。大型公共科学项目的赞助者对新思维感兴趣。

Third, the success of the SDSS project—which relied heavily on Microsoft SQL Server—seemed to suggest that it was possible to use general-purpose data management tools and technologies to support specialized scientific tasks. Given a choice between building something from scratch or reusing someone else’s components or even entire systems, the “lived preference” of working scientists had always been to roll their own—an expensive and risk-prone approach. Sponsors of large-scale public science projects were interested in new thinking.

所有这些想法都在新的会议上进行了探讨,例如超大数据库 (XLDB) 会议和研讨会:“真正的”科学家的聚会和计算机科学家在斯坦福线性加速器(SLAC)设施举行。通过一系列正式的调查方法——访谈、演示、小组讨论——以及非正式的询问——以最好的登山传统,一边喝着啤酒和葡萄酒聊到深夜——一群学者和从业者得出了一份科学家对数据管理的要求清单。技术 [Stonebraker 等人。2009]。总之,这些要求如下:

All of these ideas were being explored at new conferences such as the Extremely Large Data Bases (XLDB) conference and workshop: a gathering of “real” scientists and computer scientists held at the Stanford Linear Accelerator (SLAC) facility. Through a mix of formal survey methods—interviews, presentations, panels—and informal interrogations—talking late into the night over beer and wine in the best mountaineering tradition—a group of academics and practitioners arrived at a list of what scientists required from data management technology [Stonebraker et al. 2009]. In summary, these requirements were as follows:

• 围绕数组并基于线性代数方法的数据模型和查询语言;

•  a data model and query language organized around arrays and based on the methods of linear algebra;

• 能够集成新运算符、数据类型、函数和其他算法的可扩展系统;

•  an extensible system capable of integrating new operators, data types, functions, and other algorithms;

• 与Python、“R”和MATLAB 等新语言的绑定,而不是传统的业务数据处理语言;

•  bindings with new languages like Python, “R,” and MATLAB, rather than the traditional languages of business data processing;

• 无覆盖存储,可在数据经过多个带时间戳的版本演变时保留数据的完整历史记录(沿袭或出处);

•  no-overwrite storage to retain the full history (lineage or provenance) of the data as it evolved through multiple, time-stamped versions;

• 开源以鼓励社区贡献并尽可能扩大贡献范围;

•  open source to encourage community contributions and to broaden the range of contributions as widely as possible;

• 大规模并行或集群存储和计算方法;

•  massively parallel or cluster approach to storage and computation;

自动、n 维块数据存储(而不是散列或范围分区);

•  automatic, n-dimensional block data storage (rather than hashing or range partitioning);

• 访问原位数据(而不是要求在查询之前加载所有数据);

•  access to in-situ data (rather than requiring all data be loaded before query);

• 集成管道处理与存储和分析;和

•  integrated pipeline processing with storage and analytics; and

• 对不确定性和统计误差的一流支持。

•  first-class support for uncertainty and statistical error.

在更详细的设计方面,我们计划使用以下技术来实现 SciDB。

In more detailed design terms, we planned to implement SciDB using the following techniques.

• 我们决定使用一套Linux 开源工具以C/C++ 语言构建SciDB,并免费提供代码。我们这样做是为了优化运行时性能,以便轻松与许多免费提供的二进制库集成,这些库执行数学“繁重的工作”,而且因为 Linux 已经成为科学数据管理中的首选操作系统。

•  We decided to build SciDB in C/C++ using a suite of Linux open-source tools and made the code freely available. We did this to optimize for runtime performance, to make it easy to integrate with a number of freely available binary libraries that performed the mathematical “heavy lifting,” and because Linux had become the operating system of choice in scientific data management.

• 我们采用了无共享存储和计算模型。对于具有高可扩展性和容错要求的系统来说,这是一种常见的方法。每个 SciDB 节点(我们将它们称为实例)都是一个纯对等体计算机集群。SciDB 阵列中的数据以及应用于数据的计算工作负载以平衡的方式分布在节点之间 [Stonebraker 1986b]。

•  We adopted a shared-nothing storage and compute model. This is a common approach for systems that have high scalability and fault-tolerant requirements. Each SciDB node (we refer to them as instances) is a pure peer within a cluster of computers. Data in the SciDB arrays, and the computational workload applied to the data, are distributed in a balanced fashion across the nodes [Stonebraker 1986b].

• 每个实例都实现一个多线程解析、规划和执行引擎以及一种多版本并发控制 (MVCC) 方法来处理事务,类似于 Postgres [Stonebraker 1987] 中使用的方法。

•  Each instance implements a multi-threaded parsing, planning, and execution engine and a multi-version concurrency control (MVCC) approach to transactions similar to the one employed in Postgres [Stonebraker 1987].

• 我们选择了传统的、符合ACID 的分布式事务模型。读和写操作都是全局原子的、一致的、隔离的和持久的。

•  We opted for a conventional, ACID-compliant distributed transaction model. Read and write operations are all globally atomic, consistent, isolated, and durable.

• 我们决定采用列存储方法来组织记录(SciDB 数组中的每个单元都可以保存多属性记录),并采用与 C-Store 和 MonetDB 相同的流水线或矢量化执行器 [Stonebraker 等人。2005a,Idreos 等人。2012]。

•  We decided to adopt a column-store approach to organizing records (each cell in a SciDB array can hold a multi-attribute record) and a pipelined or vectorized executor along the same lines as C-Store and MonetDB [Stonebraker et al. 2005a, Idreos et al. 2012].

• 为了实现事务的分布式共识并支持元数据目录,我们最初依赖于 PostgreSQL DBMS 的单个(或复制以进行故障转移)安装。最终我们计划使用 Paxos [Lamport 2001] 分布式共识算法来消除这种单点故障。

•  To achieve distributed consensus for the transactions and to support the metadata catalog we relied initially on a single (or replicated for failover) installation of the PostgreSQL DBMS. Eventually we planned to use a Paxos [Lamport 2001] distributed consensus algorithm to allow us to eliminate this single point of failure.

• 由于我们需要支持的运算类型包括矩阵运算(例如矩阵/矩阵和矩阵/向量乘积)以及奇异值分解,因此我们采用并构建了ScaLAPACK 分布式线性代数包[Blackford 等人,2015]。2017]。

•  Because the kinds of operations we needed to support included matrix operations such as matrix/matrix and matrix/vector product, as well as singular value decomposition, we adopted and built on the ScaLAPACK distributed linear algebra package [Blackford et al. 2017].

• 我们决定支持Postgres 风格的用户定义类型、函数和聚合[Rowe 和Stonebraker 1987]。此外,我们决定实现一种新颖的并行运算符可扩展性模式,该模式更接近高性能计算处理此类问题的方式。

•  We decided to support Postgres style user-defined types, functions and aggregates [Rowe and Stonebraker 1987]. In addition, we decided to implement a novel mode of parallel operator extensibility that was closer to how High Performance Computing dealt with such problems.

我们的最高目标是构建一种新型 DBMS:根据科学用户的需求量身定制的 DBMS。我们与很多科学家进行了交谈,试图了解如何实现这一目标。我们听取了哪些方法有效的建议,并听取了哪些方法无效的警告。我们决定探索一些新领域并开发新技术。当我在 IBM 工作时,一些甚至更花白的胡子顺便提到,Ted Codd 在选择更简单的集合论关系模型之前,曾尝试过使用矩阵代数作为他的抽象数据模型的基础。但是,尽管这一事实证明了我们的智慧想想看,根本不清楚这是否是支持或反对我们战略方针的证据。

Our mountaintop goal was to build a new kind of DBMS: one tailored to the requirements of scientific users. We talked to a lot of scientists in an effort to understand how to get there. We took advice on what methods worked and heeded warnings about what wouldn’t. And we decided to explore some new terrain and develop new techniques. While I was working at IBM, some even grayer beards had mentioned in passing that Ted Codd had fiddled around with a matrix algebra as the basis of his abstract data model before settling on the simpler, set theoretic, Relational Model. But while this fact served as evidence as to the wisdom of our thinking, it wasn’t at all clear whether it was evidence for, or against, our strategic approach.

远征物流

Expedition Logistics

男性[原文如此]通缉:危险旅程。微薄的工资,严寒,长达数月的完全黑暗,持续不断的危险,安全返回令人怀疑

Men[sic] Wanted: For hazardous journey. Small wages, bitter cold, long months of complete darkness, constant danger, safe return doubtful.

——欧内斯特·沙克尔顿

—Ernest Shackleton

到2008年底,SciDB开发团队已经对这座山有了清晰的认识。我们已经就我们的计划达成一致。摆在我们面前的是一个简单的、统一在正统大规模并行计算框架内的 ACID 事务存储层。除此之外,我们考虑的不是 SQL 表、连接和聚合,而是数组、点积、叉积和卷积。

By the end of 2008, the SciDB development team had a clear view of the mountain. We had agreed on our plans. Immediately before us was a straightforward, ACID transactional storage layer unified within an orthodox massively parallel compute framework. Superimposed upon this, instead of SQL tables, joins, and aggregates, we were thinking in terms of arrays, dot products, cross products, and convolutions.

尽管如此,前进的道路并不完全明显。我们推迟了重要的决定。什么查询语言?有哪些客户端 API?我们的客户可能需要的大量数值分析方法又如何呢?我们推断,一旦攀登成功,我们就会获得更多信息来做出决策。但仍然存在一个明显的问题。钱。

Still, the path ahead wasn’t completely obvious. We deferred important decisions for later. What query language? What client APIs? What about the vast number of numerical analytic methods our customers might want? Once established on the climb, we reasoned, we would have more information on which to base our decisions. But there remained one glaring problem. Money.

众所周知,科学数据处理的特征是模糊的。有人说吉姆·格雷,其他人说迈克尔·斯通布雷克——这是一个“零十亿美元的问题”。向风险投资家的成功推销必须包括利润的诱惑。他们想听到的是,“当我们建造它时,用户就会带着支票簿来。” SciDB 在迈克的公司中是独一无二的,但它似乎另有目的。我们向潜在资助伙伴进行的宣传强调了这次探险的社会和科学重要性。解开宇宙的奥秘、治愈癌症或了解气候变化:使用 SciDB 这样的工具,所有这些崇高的努力都将大大加速!这是纯粹的研究。新的数据模型!新类型的查询!新的要求!学术金!令人失望的是,

Scientific data processing was notoriously characterized—and the attribution is vague; some say Jim Gray, others Michael Stonebraker—as a “zero-billion-dollar problem.” A successful pitch to venture capitalists must include the lure of lucre. What they want to hear is, “When we build it, users will come—carrying checkbooks.” SciDB, unique among Mike’s companies, seemed to be about something else. The pitch we made to potential funding partners emphasized the social and scientific importance of the expedition. Unlocking the mysteries of the universe, curing cancer, or understanding climate change: All of these noble efforts would be greatly accelerated with a tool like SciDB! It was pure research. A new data model! New kinds of queries! New kinds of requirements! Academic gold! Disappointingly, the research agencies who usually provide financial support to this kind of thing took one look at our ideas and disagreed with us about their potential merits.

还有其他事情正在发生。2008 年底,金融领域发生了一系列引人注目的失败,加上经济状况迅速恶化,导致许多潜在的 SciDB 资助者望而却步。早期,对支持专注于数组处理的开源 DBMS 感兴趣的各方包括电子商务、生物信息学和金融领域的一些重量级人物。但一旦贝尔斯登和 AIG 在各自的山上死于傲慢导致的脑缺氧,对风险的热情逐渐减弱。因此,当我们静下心来等待长达 12 个月的风暴时,SciDB 探险队不得不转向通常收集的“陆地鲨鱼”[Stonebraker 2016]。

Something else was afoot. In late 2008 a series of high-profile failures in the finance sector, combined with rapidly deteriorating economic conditions, caused many potential SciDB funders to pull in their horns. Early on, parties interested in backing an open-source DBMS that focused on array processing included some heavy hitters in e-commerce, bioinformatics, and finance. But once Bear Stearns and AIG died of hubris-induced cerebral hypoxia on their own respective mountains, enthusiasm for risk grew … thin. So, as we hunkered down to sit out the 12-month storm, the SciDB expedition was obliged to turn to the usual collection of “land sharks” [Stonebraker 2016].

到 2010 年,乌云散去,SciDB 获得了种子资金。但财务状况非常紧张,以至于在成立的最初几年里,公司的成员包括担任首席技术官的迈克、一位建筑师/首席水管工(作者)、一位肩负巨大责任的首席执行官、几位美国开发人员、一名精选人员。团队由四名俄罗斯人和两到三名兼职博士组成。候选人。我们的顾问委员会由十几个非常伟大和非常优秀的人组成。然而,正如眼光敏锐且知识渊博的读者会注意到的那样,我们的小型探险队带着一大堆行李和行李出发了。初步估计,上面“技术”列表中的每一个要点都意味着大约 30,000 行代码:20,000 行用于实现功能,另外 5,000 行用于接口,另外 5,000 行用于测试。这样的承诺意味着大约 2,000,000 行代码。由六个人撰写。一年内。

By 2010 the clouds had cleared and SciDB had secured seed funding. But the finances were so strained that for the first few years of its life, the company consisted of Mike as CTO, an architect/chief plumber (the author), a CEO shouldering immense responsibilities, a couple of U.S.-based developers, a pick-up team of four Russians, and two or three part-time Ph.D. candidates. Our advisory board consisted of the very great and the very good—a dozen of them. Yet as the discerning and knowledgeable reader will have noted, our tiny expedition was embarking with a vast train of bag and baggage. To a first approximation, every one of the bullet points in our “techniques” list above implies about 30,000 lines of code: 20,000 to implement the functionality, another 5,000 for interfaces, and another 5,000 for testing. Such an undertaking implies about 2,000,000 lines of code. To be written by six people. In a year.

有野心是一回事。山是要爬的,即使承认过于乐观也有灾难的风险。但最终……如果谨慎支配着人类的每一个决定呢?就不会有冒险了。

It’s one thing to be ambitious. Mountains are there to be climbed, even acknowledging that the overly optimistic risk disaster. Yet in the end … if prudence governed every human decision? There would be no adventures.

大本营

Base Camp

人可以一步一步地越过最高的山

One may walk over the highest mountain one step at a time.

——芭芭拉·沃尔特斯

—Barbara Walters

所以,我们开始了。自 Unix(或“开放系统”)开发初期以来,世界发生了多么大的变化!曾经,初创企业的首要任务是找到一个方便的办公室来安置程序员、计算机、管理人员和销售人员,而 SciDB 的工程师则分散在莫斯科郊区。马萨诸塞州沃尔瑟姆;和新德里的塔楼。我们共享的虚拟空间涉及源代码主干和票务系统的签入权限,这两个系统都在德克萨斯州的服务器上运行。

So, we began. How the world has changed since the early days of Unix (or “open systems”) development! Where once the first order of a start-up’s business was to find a convenient office to co-locate programmers, their computers, management, and sales staff, the engineers working on SciDB were scattered across Moscow suburbs; Waltham, Massachusetts; and a New Delhi tower block. The virtual space we all shared involved check-in privileges to a source code trunk and a ticketing system, both running on a server in Texas.

当我们开始攀爬时,我们的首要任务之一就是提出一种数据模型,为我们的工作科学家提供他们所说的想要的东西:基于数组的东西,适合线性代数、图像过滤等数值方法的应用。快速区域选择和聚合。

Among our first tasks as we began the climb was to come up with the kind of data model that gave our working scientists what they said they wanted: something based around arrays, appropriate for the application of numerical methods like linear algebra, image filtering, and fast region selection and aggregation.

所有数据模型都以某种逻辑结构概念开始。我们在图 20.1中概述了 SciDB 阵列的基础知识。在 SciDB 中,每个 n 维数组将数据组织到由数组的 n 维定义的空间中。在图 20.1中,数组 A 有两个维度:I 和 J。数组的每个维度都包含一个(有序)整数索引值列表。此外,数组的维度具有优先顺序。例如,如果数组 B 声明为维度 [I, J, K],则 B 的形状由其维度的顺序决定。因此,如果另一个数组 C 使用相同的维度但顺序不同(例如 C [K, I, J]),那么我们就说 B 的形状与 C 不同。

All data models begin with some notion of logical structure. We sketch the basics of the SciDB array in Figure 20.1. In SciDB, each n dimensional array organizes data into a space defined by the array’s n dimensions. In Figure 20.1, the array A has two dimensions, I and J. An array’s dimensions each consist of an (ordered) list of integer index values. In addition, an array’s dimensions have a precedence order. For example, if an array B is declared with dimensions [I, J, K], the shape of B is determined by the order of its dimension. So, if another array C uses the same dimensions but in a different order—for instance, C [K, I, J]—then we say the shape of B differs from C.

图像

图 20.1   SciDB 数组数据模型的结构轮廓。

Figure 20.1  Structural outline of SciDB array data model.

根据此定义,您可以使用数组名称和每个维度包含一个索引值的列表(向量)来寻址数组中的每个单元格。例如,图 20.1中的标记单元格可以被寻址为 A [I=3, J=4](或更简洁地说,A [3,4],因为索引值和维度之间的关联是根据数组的维度)。三维数组 B 中的单元将被寻址为 B [I=5, J=5, K=5](或 B [5,5,5])。您可以通过枚举每个维度的范围来指定数组的子区域,并且数组的任何区域本身就是一个数组:A [I= 3 到 6,J= 4 到 6]。

From this definition you can address each cell in an array by using an array’s name, and a list (vector) consisting of one index value per dimension. For example, the labeled cell in Figure 20.1 can be addressed as A [I=3, J=4] (or more tersely, A [3,4] as the association between the index values and dimensions is inferred from the order of the array’s dimensions). A cell in the three-dimensional array B would be addressed as B [I=5, J=5, K=5] (or B [5,5,5]). You can specify sub-regions of an array—and any region of an array is itself an array—by enumerating ranges along each dimension: A [I= 3 to 6, J= 4 to 6].

SciDB 数组有约束规则。有些是模型隐含的。数组名称和维度值的组合唯一确定一个单元格,并且该单元格不能以任何其他方式寻址。某些约束可以明确作为数组定义的一部分。例如,数组维度中的索引值列表可以限制在某个范围内。如果应用程序需要一个数组来保存随时间变化的 1024 × 1024 大小的图像,用户可能会将数组的两个维度限制为 1 到 1024 之间的值(建议将这些维度命名为 X 和 Y)并保留最后一个维度 -命名为 T——未绑定。然后,SciDB 将拒绝任何在其允许区域之外的位置将单元插入该数组的尝试:例如,对于任何 T,X=1025,Y=1025。我们设想了其他类型的约束规则:两个数组共享一个维度,

SciDB arrays have constraint rules. Some are implicit to the model. The combination of an array name and dimension values uniquely determines a cell, and that cell cannot be addressed in any other way. Some constraints can be made explicit as part of an array’s definition. For example, the list of index values in array’s dimensions can be limited to some range. If an application requires an array to hold 1024 × 1024-sized images as they changed over time, a user might limit two of the array’s dimensions to values between 1 and 1024—suggestively naming these dimensions X and Y—and leave the last dimension—named T—as unbound. SciDB would then reject any attempt to insert a cell into this array at a location outside its allowed area: say at X=1025, Y=1025, for any T. We envisioned other kinds of constraint rules: two arrays sharing a dimension, or a rule to say that an array must be dense, which would reject data where a cell within the array’s dimensions space was “missing.”

图像

图 20.2  功能语言 SciDB 查询示例。

Figure 20.2  Example functional language SciDB query.

并且,与关系模型一样,SciDB 数组数据模型定义了运算符的封闭代数。简单的一元运算符按维度索引范围或单元格属性值过滤数组。分组运算符会将数组分解为(有时重叠)子数组,并计算每个组的一些聚合。二元运算符将组合两个数组以产生具有新形状和新数据的输出。一个运算符的输出可以成为另一个运算符的输入,从而允许无限、灵活的组合。在图20.2,我们说明了组合多个运算符的简单 SciDB 查询是什么样子。内部 filter() 和 Between() 运算符根据输入数组的逻辑位置和单元格属性中的值指定要作为输出传递的输入数组的单元格。regrid() 将过滤后的数据(现在是稀疏数组)划分为 2×2 大小的区域,并计算每个区域的聚合。

And, as with the relational model, the SciDB array data model defines a closed algebra of operators. Simple, unary operators filter an array by dimension index range or cell attribute values. Grouping operators would break an array up into (sometimes overlapping) sub-arrays and compute some aggregate per group. Binary operators would combine two arrays to produce an output with a new shape and with new data. Output of one operator can become input to another, allowing endless, flexible combinations. In Figure 20.2, we illustrate what a simple SciDB query combining multiple operators looks like. The inner filter() and between() operators specify what cells of the input array are to be passed on as output based on their logical location and the values in the cell’s attributes. The regrid() partitions the filtered data—now a sparse array—into 2-by-2 sized regions and computes an aggregate per region.

这些数组运算符的列表很广泛。最新的 SciDB 社区版本(其代码在 Affero GPL 许可证下提供)附带了大约 100 个内置版本,而 Paradigm4 的专业服务团队创建了另外 30-40 个插件,可以插入到基本框架中 - 尽管它们是不是开源的,因为它们有时使用专有的第三方库。事实上,一些 SciDB 用户甚至创建了自己的定制和特定于应用程序的扩展。运算符涵盖从提供简单的选择、投影和分组功能到矩阵乘法和奇异值的各种功能分解为图像处理操作,例如连接组件标记 [Oloso 等人。2016]。每个运营商都旨在利用无共享架构来大规模运行。与关系运算符一样,可以探索多种方法来重新组织运算符树,以找到逻辑上等效但计算上更有效的访问路径来回答复合查询。

The list of these array operators is extensive. The most recent SciDB community version—the code for which is made available under the Affero GPL license—ships with about 100 of them built in, while Paradigm4’s professional services group have created another 30–40 that plug into the basic framework—although they are not open source because they sometimes use proprietary third-party libraries. In fact, several SciDB users have even created their own customized and application specific extensions. Operators run the gamut from providing simple selection, projection, and grouping functionality to matrix multiplication and singular value decomposition to image processing operations such as Connected Component Labeling [Oloso et al. 2016]. Every operator is designed to exploit the shared-nothing architecture to function at massive scale. And as with relational operators, it is possible to explore a number of ways to reorganize a tree of operators to find logically equivalent but computationally more efficient access paths to answer compound queries.

也许值得强调的是这个领域是多么新颖。自从 Codd 于 1971 年提出最初的关系模型以来,DBMS 数据模型就趋向于相当特别。他们将从编程语言思想开始——例如面向对象编程的类概念、类层次结构、类间引用和类之间的“消息”——并附加事务存储或数据更改等功能。举另一个例子,XML 数据库从分层标记语言的概念开始,然后提出了用于在分层结构上指定搜索表达式的语法 — Xpath 和 XQuery。SciDB 与 Codd 的关系模型共享从抽象数学框架(向量、矩阵和线性代数)开始的想法,并通过根据这些基本原理进行构建来充实数据模型。

It’s perhaps worth emphasizing just how novel this territory was. Since Codd’s original Relational Model from 1971, DBMS data models have tended to be rather ad hoc. They would start with a programming language idea—such as object-oriented programming’s notions of class, class hierarchy, inter-class references, and “messages” between classes—and would bolt on features such as transactional storage or data change. XML databases, to point to another example, started with a notion of a hierarchical markup language and then came up with syntax for specifying search expressions over the hierarchy—Xpath and XQuery. SciDB shared with Codd’s Relational Model the idea of starting with an abstract mathematical framework—vectors, matrices, and the linear algebra—and fleshed out a data model by building from these first principles.

就像 Postgres 通过提供用户定义的类型、函数和聚合来利用被忽视的关系域概念来解决非标准数据问题一样,SciDB 通过重新审视来解决科学数据处理的非常实际的问题。数学基础。新的山脉需要新的攀登方法。

In the same way that Postgres solved the problem of non-standard data by exploiting the neglected notion of relational domains through the provision of user-defined types, functions, and aggregates, SciDB set about solving the very practical problems of scientific data processing by revisiting the mathematical fundamentals. New mountains can necessitate new climbing methods.

计划、山脉和高原反应

Plans, Mountains, and Altitude Sickness

没有任何计划能够在接触后继续存在……

No plan survives contact …

——老赫尔穆特·冯·毛奇

—Helmuth von Moltke the Elder

经过大约九个月的狂热之后,我们有了一个可以编译成可执行文件的代码库,一个可以安装为 DBMS 的可执行文件,以及一个可以加载数据并运行查询的 DBMS。是时候迎接我们的第一批用户了!

After about nine months of mania we had a code base we could compile into an executable, an executable we could install as a DBMS, and a DBMS we could load data into and run queries against. Time for our first users!

我们决定将精力集中在生物信息学市场,因为新测序技术产生的数据量大幅增加,整合多条证据以验证更复杂的系统模型的科学要求,以及提供具有分析功能的系统软件的需求它的规模超出了受个体科学家欢迎的桌面工具。此外,马萨诸塞州剑桥市拥有两所世界一流大学,以及世界上几家最大的制药制造商和研究机构运营的实验室机构和许多初创公司都是科学家和研究人员的所在地,他们试图了解我们的基因和环境如何结合和相互作用,从而使我们生病或健康。这些研究人员使用的基本方法似乎非常适合 SciDB 的数据模型和方法。染色体是核苷酸的有序列表。梳理遗传变异与患者结果之间的因果关系涉及诸如超大规模统计计算之类的方法。即使是像 SciDB 的无覆盖存储这样的小功能,在生物信息学中也很有趣,因为需要保证分析查询结果的可重复性。

We decided to focus our energies on the bioinformatics market because of the vast increase in data volumes generated by new sequencing technologies, the scientific requirement to integrate multiple lines of evidence to validate more complex systems models, and the need to provide systems software with analytic capabilities that scaled beyond the desktop tools popular with individual scientists. Plus, Cambridge, Massachusetts, is home to two of the world’s great universities, to labs operated by several of the world’s largest pharmaceutical makers and research institutions, and any number of startups, all home to scientists and researchers seeking to understand how our genes and our environment combine and interact to make us sick or well. The underlying methods these researchers used appeared well-suited to SciDB’s data model and methods. A chromosome is an ordered list of nucleotides. Teasing out causal relationships between genetic variants and patient outcomes involves methods like extremely large-scale statistical calculations. Even minor features like SciDB’s no-overwrite storage were interesting in bioinformatics because of the need to guarantee the reproducibility of analytic query results.

SciDB 还吸引了许多更“纯”科学项目的兴趣。我们与几个国家实验室合作开展了各种各样的项目,包括在南达科他州矿井底部发现弱相互作用的大颗粒、与 NASA 团队一起跟踪美国上空雷达数据中的暴风雨天气模式,以及寻找气候变化的证据巴西雨林的卫星图像。有趣的是,这些早期合作中出现的一种模式是 SciDB 受到规模较小、资金更匮乏的项目团队的热烈欢迎。鉴于资源丰富,科学家们仍然更喜欢“自己动手”。当资源有限时,SciDB 被证明是有用的。

SciDB had also attracted interest from a number of more “pure” science projects. We worked with a couple of National Labs on projects as diverse as spotting weakly interacting massive particles in the bottom of a mine in South Dakota, tracking stormy weather patterns in RADAR data over the USA with a NASA team, and looking for evidence of climate change in satellite images of the Brazilian rainforest. Interestingly, a pattern that emerged from these early engagements saw SciDB eagerly embraced by smaller, more poorly funded project teams. Given abundant resources scientists still preferred “rolling their own.” SciDB proved useful when resources were constrained.

与这些早期采用者合作后不久,我们就意识到,一旦我们到达了大本营,我们所在的山峰与我们在接近遥远的平原时所看到的山峰并不相符。

It didn’t take long, working with these early adopters, to realize that once we got above base camp, the mountain we were on didn’t match the one we’d seen while approaching over distant plains.

首先,我们采访的科学家们几乎普遍存在一个隐含的假设,即所有阵列数据都是密集的。在他们看来,科学数据由定期捕获的直线图像集合组成。虽然卫星相机的直角不能完全符合球形行星的轮廓,但图像数据本身并不存在“孔洞”。我们很快发现生物医学数据,甚至我们收到的大部分“纯”科学数据,实际上都很稀疏。它的特点是空间、时间和位置上的差距,而不是连续性。当然,从稀疏数据导出并用于执行数学和数值分析的矩阵是密集的,大部分图像和传感器数据也是如此。

First, an implicit assumption almost universal to the scientists we talked to held that all array data was dense. In their view, scientific data consisted of collections of rectilinear images captured at regular intervals. While it might be ragged in the sense that the right-angled corners of a satellite’s camera don’t mold exactly to the contours of a spherical planet, image data itself doesn’t have “holes.” What we found out quite quickly was that the biomedical, and even much of the “pure” science data we were handed, was actually sparse. It was characterized by gaps in space and time and position rather than continuity. Of course, the matrices derived from the sparse data and used to perform the mathematical and numerical analysis were dense, as was much of the image and sensor data.

因此,我们必须重新思考我们方法的各个方面。我们必须想出一种将每个序列化值与其在数组中的逻辑位置相关联的方法,而不是简单地将数据分解成块,序列化每个块并将序列化块写入磁盘。我们通过一种位掩码来实现这一点。给定一个逻辑数组,我们首先为数据中出现的每个单元生成逻辑位置列表,然后对该列表进行压缩和编码。因此,对于密集数据,“所有单元都存在”的游程编码(RLE)表示可以非常简洁:每兆字节数据块 48 字节元数据。或者,对于非常稀疏的数据,相同的方法为每个有效单元格生成一个简洁的条目。最常见的情况是,长时间运行的值以及偶尔运行的空单元格,也可以很好地压缩。

So, we had to rethink aspects of our approach. Instead of simply breaking the data up into blocks, serializing each block and writing serialized blocks to disk, we were obliged to come up with a method for associating each serialized value with its logical position in the array. We accomplished this with a kind of bit-mask. Given a logical array, we would first generate a list of the logical positions for each cell that occurred in the data, and then compress and encode this list. Thus, for dense data, the Run Length Encoding (RLE) representation of “all cells are present” could be very terse: 48 bytes of metadata per megabyte data chunk. Alternatively, for very sparse data, the same method yielded a terse entry per valid cell. And the most common case, long runs of values with the occasional run of empty cells, also compressed well.

进行此更改需要对我们的存储层、执行器和操作符逻辑造成相当大的破坏。但它使 SciDB 在密集数组和稀疏数组方面同样精通。我们的查询语言以及 SciDB 的用户不需要知道他们正在处理哪种数组数据。

Making this change required inflicting considerable violence on our storage layer, executor, and operator logic. But it made SciDB equally proficient at both dense and sparse arrays. Our query language and, therefore, SciDB’s users, did not need to know which kind of array data they were dealing with.

其次,当我们期望从已经组织成矩阵行和列的外部应用程序获取数据时,我们发现更常见的组织是高度“相关的”。更常见的表示方式不是列出第一行中的所有值,然后列出第二行中的所有值,依此类推,而是使用 { row #, column #, values ... } 文件。这意味着我们必须在加载时将数据重新组织为 SciDB 数组形式。这在概念上并不是一项困难的任务。但这意味着,除了跟上我们计划的开发进度之外,我们还必须实现高效、大规模并行的排序操作。无需额外人员。

Second, where we expected to get data from external applications already organized into matrix rows and columns, we found that the more usual organization was highly “relational.” Instead of a file that listed all values in the first row, then all values in the second, and so on, the more usual presentation was a { row #, column #, values … } file. This meant that we had to reorganize data on load into the SciDB array form. Not a conceptually difficult task. But it meant that, in addition to keeping up with our planned development schedule, we had to implement a highly efficient, massively parallel sort operation. Without additional staff.

第三,我们了解到,任何从事过列存储 DBMS 工作的人都熟悉的数据编码方法对于用于数字音频和视频的更专业的编码方法来说是一个非常糟糕的替代品。SciDB 的一些初始客户希望将视频数据与可穿戴设备的传感器数据结合起来。例如,他们想知道视频中的哪些物理运动可能在时间上与加速度计的变化相关。但当我们尝试使用列存储编码将视频数据拉入 SciDB 时,我们发现数据量不断膨胀。视频和音频数据最好使用专门的格式进行管理,与自适应霍夫曼或 RLE 等技术不同,这种格式关注数据区域而不是简单的值序列化。为了容纳音频和视频,

Third, we learned that the kinds of data encoding methods familiar to anyone who has worked on column-store DBMSs were a very poor substitute for more specialized encoding methods used for digital audio and video. Several of SciDB’s initial customers wanted to combine video data with sensor data from wearables. They wanted to know, for example, what physical movement in a video could be temporally associated with accelerometer changes. But when we tried to pull video data into SciDB using column-store encodings, we found ourselves bloating data volumes. Video and audio data is best managed using specialized formats that, unlike techniques such as Adaptive Huffman or RLE, pay attention to regions of data rather than a simple serialization of values. To accommodate audio and video, we were obliged to retrofit these specialized methods into SciDB by storing blocks of audio and video data separately from the array data.

这些早期的意外是在设计 SciDB 之前没有与足够多的人交谈的结果。我们咨询的科学家名单中没有生物信息学研究人员。

These early surprises were the consequence of not talking to enough people before designing SciDB. No bioinformatics researchers were included on the list of scientists we consulted.

我们尚未完全理解的另一组要求涉及弹性和高可用性问题。在几个 SciDB 应用程序中,我们发现我们在数十个物理计算节点上运行所需的数据和工作负载规模。再加上我们被要求运行的许多线性代数例程的计算复杂度是二次的,这表明我们需要面对动态扩展计算资源的前景以及设计中硬件故障和故障的确定性。

Another set of requirements we had not fully appreciated revolved around the questions of elasticity and high availability. In several SciDB applications, we found the data and workload scale required that we run across dozens of physical compute nodes. This, combined with the fact that many of the linear algebra routines we were being asked to run were quadratic in their computational complexity, suggested that we needed to confront the prospect of dynamically expanding compute resources and the certainty of hardware faults and failures in our design.

从正在运行的集群中添加(和减少)计算机而不关闭它涉及一些相当复杂的软件工程。但随着时间的推移,我们能够在整个系统出现一些物理节点故障的情况下支持用户运行读取查询,我们甚至能够向正在运行的集群添加新的物理节点,而无需关闭整个系统甚至停止运行。随着云计算的出现,我们预计此类功能将成为 Mike Stonebraker 所说的“赌注”。

Adding (and subtracting) computers from a running cluster without shutting it down involves some pretty sophisticated software engineering. But over time, we were able to support users running read queries while the overall system had some physical nodes faulted out, and we were even able to add new physical nodes to a running cluster without requiring that the overall system be shut down or even quiesced. With the emergence of cloud computing, we expect such functionality will become what Mike Stonebraker refers to as “table stakes.”

探险队还通过他们留下的东西来衡量进度。我们了解了攀登的新要求。但我们的愿望超出了我们的人员配置。因此,当我们开始通过与 Paradigm4 的早期 SciDB 客户交谈来获得有关我们的优先事项应该是什么的反馈时,我们削减了我们的目标并推迟了并非立即需要的功能。

Expeditions also measure progress by the thing they leave behind. We had learned about new requirements on the climb. But our aspirations exceeded our staffing. So, as we began to get feedback about what our priorities ought to be by talking to Paradigm4’s early SciDB customers, we trimmed our ambition and deferred features that were not required immediately.

首先失去的是血统支持。虽然原则上很简单——我们的无覆盖版本控制存储层和我们维护每个用户查询的全面日志的方式的结合为我们提供了我们需要的所有信息——但很难确定精确的需求列表。对于某些用户来说,记录数据加载操作和用作数据源的文件就足够了。对于其他人来说,需要跟踪每个单元格的价值来自哪里。其他人提到,他们希望跟踪每个读取查询中使用的数据的精确版本,以确保可以重现任何有趣的结果。然而出处支持从来都不是任何人的首要任务。性能、稳定性或某些分析功能始终优先。

First to go was lineage support. Although simple enough in principle—the combination of our no-overwrite, versioning storage layer and the way we maintained a comprehensive log of every user query gave us all the information we needed—it was difficult to nail down the precise list of requirements. For some users it would have been enough to record data loading operations and the files that served as data sources. For others there was a need to track where every cell’s value came from. Others mentioned that they would like to track the precise versions of data used in every read query to ensure that any interesting result could be reproduced. Yet provenance support was never anyone’s highest priority. Performance or stability or some analytic feature always took precedence.

我们也从未设法嵌入概率推理或使管理不确定性成为数据模型的首要特征。对于一个用户,我们甚至实现了完整的概率分布函数作为与超大规模蒙特卡洛计算结合应用的基本数据类型:128 GB 的输入数据、2 GB 的静态分布和 10,000 次模拟运行。但这是通过用户定义的类型和函数可扩展性来完成的。与起源一样,支持不确定性并不是任何人的首要任务。

We also never managed to embed probabilistic reasoning or to make managing uncertainty a first-class feature of the data model. For one user we went so far as to implement a full probability distribution function as a basic data type applied in conjunction with a very large-scale Monte Carlo calculation: 128 gigabytes of input data, a 2-gigabyte static distribution, and 10,000 simulation runs. But this was accomplished through user-defined type and function extensibility. As with provenance, supporting uncertainty was no one’s highest priority.

我们的类似 SQL 的查询语言是此时的另一个受害者。我们将相当多的精力集中在与“R”和 Python 等客户端语言的语言绑定上,因为这些是生物信息学家、定量分析师和数据科学家首选的语言。在这两种语言中,为 COBOL、C/C++ 或 Java 等语言设计的连接/查询/结果/一次行过程接口都不合适。相反,我们的想法是将更高抽象级别的数据管理操作直接嵌入到语言中。例如,“R”有一个数据帧的概念,它在逻辑上类似于SciDB数组。所以,另外为了从 SciDB 数组中提取数据并将其放入客户端“R”程序中,我们尝试设计一些接口来掩盖数据帧数据的物理位置。用户可能与行为与“R”数据框行为完全相同的对象进行交互。但在幕后,客户端界面将用户操作传递到 SciDB,在将控制权交还给最终用户之前,SciDB 会并行、大规模地执行逻辑上等效的操作,并受到事务服务质量保证的约束。因此,类似 SQL 的语言是多余的。因此,我们优先考虑数组函数语言(AFL)。

Our SQL-like query language was another casualty at this time. We had focused considerable effort on language bindings to client languages like “R” and Python—as these were the languages preferred by bioinformaticians, quants, and data scientists. In neither of these languages was the kind of Connection/Query/Result/Row-at-a-Time procedural interface designed for languages like COBOL, C/C++, or Java appropriate. Instead, the idea was to embed data management operations at a higher level of abstraction directly within the language. For example, “R” has a notion of a data frame, which is logically similar to a SciDB array. So, in addition to mechanisms to pull data out of a SciDB array to place it in a client side “R” program, we tried to design interfaces that would obscure the physical location of the data frame data. A user might interact with an object that behaved exactly as an “R” data frame behaved. But under the covers, the client interface passed the user action through to SciDB where a logically equivalent operation was performed—in parallel, at scale, and subject to transactional quality of service guarantees—before handing control back to the end user. A SQL-like language was therefore superfluous. So, we prioritized the Array Functional Language (AFL).

有时,你拒绝的事情最终对你的成功至关重要。我们的工程限制导致我们推迟了大量的行项目级功能:非整数维度、附加数据编码、正统内存管理和功能齐全的优化器等等。考虑到这些限制,前面的攀登看起来令人畏惧。尽管如此,我们还是坚持了下来。

Sometimes it’s the things you say “no” to that end up being critical to your success. Our engineering constraints led us to defer a large list of line-item level features: non-integer dimensions, additional data encodings, orthodox memory management, and a fully featured optimizer, among others. Given such limitations, the climb ahead looked daunting. Nevertheless, we persisted.

在山顶

On Peaks

“一座山的山顶总是另一座山的山底。”

“The top of one mountain is always the bottom of another.”

——玛丽安·威廉姆森

—Marianne Williamson

“好深啊!”

“What a deepity!”

——丹尼尔·丹尼特

—Daniel Dennett

“我们征服的不是高山,而是我们自己。”

“It is not the mountains that we conquer, but ourselves.”

——埃德蒙·希拉里

—Edmund Hillary

“这样好多了。”

“That’s better.”

——丹尼尔·丹尼特

—Daniel Dennett

项目实施六年后,我们可以非常满意地回顾我们所经历的一切。

Six years into the project, we can look back over the ground we’ve covered with considerable satisfaction.

截至 2017 年底,SciDB 已用于十多个“工业”应用程序的生产中,主要是在生物信息学领域,但也有混合的其他用例。SciDB 是美国国立卫生研究院的 1000 个基因组浏览器和斯坦福大学医学院的全球生物库引擎等公共数据网站背后的引擎。正如我们之前提到的,许多生产系统的特点是 SciDB 证明自己是替代方案中最好、最便宜的选择(就应用程序开发时间和资本投资而言)。我们从用户那里了解到,存在巨大的隐性成本当你的“数据库”实际上只是一个“文件的大沼泽”时,开发应用程序。SciDB 应用程序涵盖了从包含数千人基因组测序的大型数据库到使用视频和运动传感器数据来分析不同类型的人体在不同情况下如何反应的细粒度细节的应用程序,到具有以下功能的金融应用程序:非常高的数据吞吐量和复杂的分析工作负载。

As of late 2017 SciDB is used in production in well over a dozen “industrial” applications, mostly in bioinformatics, but with a mixed bag of other use cases for good measure. SciDB is the engine behind public data sites like the National Institutes of Health’s 1000 Genomes Browser and the Stanford Medical School’s Global Biobank Engine. As we mentioned earlier, many of these production systems are characterized by the way SciDB proved itself the best, least expensive option—in terms of application development time and capital investment—among alternatives. What we’ve learned from our users is that there are significant hidden costs in developing applications when your “database” is really just a “big swamp of files.” SciDB applications run the gamut from very large databases containing the sequenced genomes of thousands of human beings, to applications that use video and motion sensor data to analyze fine-grained details of how different kinds of human body respond in different circumstances, to financial applications with very high data throughput and sophisticated analytic workloads.

但对于一个旨在促进科学研究的平台来说,最令人兴奋的事情可能是依赖 SciDB 作为数据管理工具和分析平台的纯科学研究项目的数量。SciDB 解决方案团队正在与哈佛医学院、普渡大学、NASA 和巴西 INPE 等机构的研究人员合作,并继续与国家实验室进行历史性合作。SciDB 前所未有的可扩展性和灵活性使研究人员能够完善新技术和方法,然后将其提供给更广泛的学术界 [Gerhardt 等人,2017]。2015]。

But perhaps the most exciting thing, for a platform built with the intention of furthering scientific study, is the number of pure science research projects that have come to rely on SciDB as their data management tool and analytic platform. The SciDB solutions team are collaborating with researchers from institutions such as Harvard Medical School, Purdue University, NASA, and Brazil’s INPE, as well as continuing historical collaborations with national laboratories. SciDB’s unprecedented extensibility and flexibility has allowed researchers to perfect new techniques and methods and then make them available to a broader academic community [Gerhardt et al. 2015].

迈克·斯通布雷克(Mike Stonebraker)一开始就在那里。他认识到这项工作的重要性,以及攀登的根本利益。如果没有他的影响力和声誉,我们的探险可能永远不会开始:在过去 50 年经济最困难的时期,他成功地推动我们前进,直到我们有幸获得开始工作的资金。在整个攀登过程中,他一直是一位稳定而友善的同伴,一位固执己见但非常宝贵的向导,一位明智而务实的夏尔巴人,并不断提醒我们将一只脚放在另一只脚前面的重要性。

Mike Stonebraker was there at the beginning. He recognized the importance of the work, and the fundamental interest of the climb. Without his influence and reputation our expedition might never have even begun: Through some of the most difficult economic times of the last 50 years he managed to move us ahead until we were fortunate enough to secure the funding to start our work. Throughout the climb he has been a steady and congenial companion, an opinionated but invaluable guide, a wise and pragmatic Sherpa, and a constant reminder of the importance of just putting one foot in front of the other.

致谢

Acknowledgments

作者要感谢 Paradigm4 的首席执行官 Marilyn Matz 和解决方案组经理 Alex Poliakov 提供的宝贵而慷慨的帮助,他们阅读了本章的草稿并提供了回忆和建议。

The author would like to acknowledge the invaluable and generous assistance of Marilyn Matz and Alex Poliakov, the CEO and Solutions Group Manager at Paradigm4, respectively, who read a draft of this chapter and provided memories and suggestions.

1 . 迈克尔·斯通布雷克 (Michael Stonebraker) 的第七章是本章前后必读的配套书籍。

1. Chapter 7 by Michael Stonebraker is a must-read companion to this chapter, either before or after.

21

21

大规模数据统一:Data Tamer

Data Unification at Scale: Data Tamer

伊哈布·伊利亚斯

Ihab Ilyas

在本章中,我将介绍 Mike Stonebraker 最新成立的初创公司 Tamr,该公司是我和他于 2013 年与 Andy Palmer、George Beskales、Daniel Bruckner 和 Alexander Pagan 共同创立的。Tamr 是学术原型“Data Tamer”的商业实现 [Stonebraker et al. 2013b]。我描述了我们如何在 2012 年启动这个学术项目、为什么这样做,以及在撰写本章时它如何发展成为数据集成和统一方面的主要商业解决方案提供商之一。

In this chapter, I describe Mike Stonebraker’s latest start-up, Tamr, which he and I co-founded in 2013 with Andy Palmer, George Beskales, Daniel Bruckner, and Alexander Pagan. Tamr is the commercial realization of the academic prototype “Data Tamer” [Stonebraker et al. 2013b]. I describe how we started the academic project in 2012, why we did it, and how it evolved into one of the main commercial solution providers in data integration and unification at the time of writing this chapter.

Mike 独特而大胆的愿景针对了许多学者认为“已解决”的问题,并且仍然通过 Tamr 在该领域发挥着领导作用。

Mike’s unique and bold vision targeted a problem that many academics had considered “solved” and still provides leadership in this area through Tamr.

我是如何参与其中的

How I Got Involved

2012 年初,Mike 和我以及三名研究生(Mike 的学生 Daniel Bruckner 和 Alex Pagan,以及我的学生 George Beskales)启动了数据驯服器项目,以解决臭名昭著的数据集成和统一问题,主要是记录重复数据删除和模式映射。当时,我正在滑铁卢大学休假,领导卡塔尔计算研究所的数据分析小组,并与麻省理工学院的迈克合作开展另一个联合研究项目。

In early 2012, Mike and I, with three graduate students (Mike’s students, Daniel Bruckner and Alex Pagan, and my student, George Beskales), started the data tamer project to tackle the infamous data-integration and unification problems, mainly record deduplication and schema mapping. At the time, I was on leave from the University of Waterloo, leading the data analytics group at the Qatar Computing Research Institute, and collaborating with Mike at MIT on another joint research project.

在行业分析师、技术供应商和媒体的推动下,“大数据”热潮达到顶峰。企业在获取大量数据方面做得越来越好,迫切需要查询和分析更多样化的数据集,并且速度更快。然而,这些异构数据集通常积累在低价值的“数据湖”中,其中包含大量脏数据和断开连接的数据集。人们在狂热中迷失了这样一个事实:分析“坏”或“脏”数据(总是一个问题)通常比根本不分析数据更糟糕——现在这个问题因企业想要分析的数据种类而倍增。传统的数据集成方法,例如 ETL(提取、转换加载),过于手动且速度太慢,需要大量领域专家(了解数据并能够做出良好集成决策的人员)。因此,企业估计有 80% 的时间用于准备分析数据,而只有 20% 的时间真正进行分析。我们真的很想改变这个比例。

Encouraged by industry analysts, technology vendors, and the media, “big data” fever was reaching its peak. Enterprises were getting much better at ingesting massive amounts of data, with an urgent need to query and analyze more diverse datasets, and do it faster. However, these heterogeneous datasets were often accumulated in low-value “data lakes” with loads of dirty and disconnected datasets. Somewhat lost in the fever was the fact that analyzing “bad” or “dirty” data (always a problem) was often worse than not analyzing data at all—a problem now multiplied by the variety of data that enterprises wanted to analyze. Traditional data-integration methods, such as ETL (extract, transform load), were too manual and too slow, requiring lots of domain experts (people who knew the data and could make good integration decisions). As a result, enterprises were spending an estimated 80% of their time preparing to analyze data, and only 20% actually analyzing it. We really wanted to flip this ratio.

当时,我正在研究多个数据质量问题,包括数据修复和表达质量约束 [Beskales 等人。2013年,楚等人。2013a,Chu 等人。2013b,Dallachiesa 等人。2013]。Mike 提出了两个尚未解决的基本数据集成问题:记录链接(通常是指跨引用同一现实世界实体的多个源的记录进行链接)和模式映射(映射不同数据集的列和属性)。我仍然记得问 Mike:“为什么要进行重复数据删除和架构映射?” 迈克的回答是:“这些论文都没有应用于实践……我们需要正确地构建它。” 迈克想要解决一个真正的客户问题:在很短的时间内以更高的精度集成不同的数据集。正如迈克在第 7 章中描述的那样,这就是我们需要的“好主意”!我们能够获取并使用来自 Goby 的数据,Goby 是一个消费者网站,聚合并集成了约 80,000 个 URL,收集有关“要做的事情”和事件的信息。

At the time, I was working on multiple data quality problems, including data repair and expressive quality constraints [Beskales et al. 2013, Chu et al. 2013a, Chu et al. 2013b, Dallachiesa et al. 2013]. Mike proposed the two fundamental unsolved data-integration problems: record linkage (which often refers to linking records across multiple sources that refer to the same real-world entity) and schema mapping (mapping columns and attributes of different datasets). I still remember asking Mike: “Why deduplication and schema mapping?” Mike’s answer: “None of the papers have been applied in practice .… We need to build it right.” Mike wanted to solve a real customer problem: integrating diverse datasets with higher accuracy and in a fraction of the time. As Mike describes in Chapter 7, this was the “Good Idea” that we needed! We were able to obtain and use data from Goby, a consumer web site that aggregated and integrated about 80,000 URLs, collecting information on “things to do” and events.

后来我们获得了另外两个现实生活中的“用例”:用于模式集成(来自制药公司诺华,该公司与我们共享其数据结构)和用于实体整合(来自 Verisk Health,该公司正在集成来自 30 多个来源的保险索赔数据)。

We later acquired two other real-life “use cases”: for schema integration (from pharmaceutical company Novartis, which shared its data structures with us) and for entity consolidation (from Verisk Health, which was integrating insurance claim data from 30-plus sources).

Data Tamer:想法和原型

Data Tamer: The Idea and Prototype

此时,我们已经验证了我们的好想法,我们准备好进入第二步:组建团队并构建原型。迈克有一个限制:“无论我们做什么,最好规模化!” 在接下来的三个月中,我们致力于集成两个解决方案:(1) 可扩展的架构映射,由 Mike、Daniel 和 Alex 领导,(2) 记录重复数据删除,由 George 和我领导。构建原型非常有趣,我们不断针对真实数据集进行测试。我将简要描述这两个问题并强调我们解决的主要挑战。

At this point we had validated our good idea, and we were ready to move to Step Two: assembling the team and building the prototype. Mike had one constraint: “Whatever we do, it better scale!” In the next three months, we worked on integrating two solutions: (1) scalable schema mapping, led by Mike, Daniel, and Alex, and (2) record deduplication, led by George and me. Building the prototype was a lot of fun and we continuously tested against the real datasets. I will briefly describe these two problems and highlight the main challenges we tackled.

模式映射。不同的数据源可能以不同的方式并使用不同的词汇和模式来描述相同的实体(例如,客户、零件、地点、研究、交易或事件)(数据集的模式基本上是主要属性和属性的正式描述)。他们可以采用的值的类型)。例如:虽然一个来源可能将产品的零件称为两个属性(零件描述零件编号),但第二个来源可能使用术语“项目描述”“零件编号”,第三个来源可能使用术语“描述”。和PN来描述相同的事物(参见图 21.1)。在这些属性之间建立映射是模式映射的主要活动。在一般情况下,问题可能更具挑战性,并且通常涉及不同的概念化,例如,当一个源中的关系在另一个源中表示为实体时,但我们不会在这里讨论这些。

Schema Mapping. Different data sources might describe the same entities (e.g., customers, parts, places, studies, transactions, or events) in different ways and using different vocabularies and schemas (a schema of a dataset is basically a formal description of the main attributes and the type of values they can take). For example: While one source might refer to a part of a product as two attributes (Part Description and Part Number), a second source might use the terms Item Descrip and Part #, and a third might use Desc. and PN to describe the same thing (cf. Figure 21.1). Establishing a mapping among these attributes is the main activity in schema mapping. In the general case, the problem can be more challenging and often involves different conceptualizations, for example when relationships in one source are represented as entities in another, but we will not go through these here.

图像

  需要图 21.1模式映射来链接描述零件编号的所有不同列。(来源:Tamr Inc.)

Figure 21.1  Schema mapping is needed to link all different columns describing the part number. (Source: Tamr Inc.)

大多数商业模式映射解决方案(通常是 ETL 套件的一部分)传统上侧重于映射少量这些模式(通常少于十个),并为用户提供建议的映射,同时考虑到列名称及其内容之间的相似性。然而,随着大数据堆栈的成熟,企业现在可以轻松获取大量数据源,并拥有可以在生成数据源时摄取数据源的应用程序。

Most commercial schema mapping solutions (usually part of an ETL suite) traditionally focused on mapping a small number of these schemas (usually fewer than ten), and on providing users with suggested mappings taking into account similarity among columns’ names and their contents. However, as the big data stack has matured, enterprises can now easily acquire a large number of data sources and have applications that can ingest data sources as they are generated.

一个完美的例子是制药行业的临床研究,全球科学家进行了数以万计的研究/测定,通常使用不同的术语以及标准和本地模式的组合。对收集的数据进行标准化和交叉映射对于公司的业务至关重要,并且通常是法律法规的强制要求。这改变了大多数模式映射解决方案的主要假设:用户在主要手动过程中提出建议。我们的主要挑战是:(1)如何提供需要与用户合理交互的自动化解决方案,同时能够映射数千个模式;(2)如何设计足够强大的匹配算法以适应不同的语言、格式、参考主数据以及数据单元和粒度。

A perfect example is clinical studies in the pharmaceutical industry, where tens of thousands of studies/assays are conducted by scientists across the globe, often using different terminologies and a mix of standards and local schemas. Standardizing and cross-mapping collected data is essential to the companies’ businesses, and is often mandated by laws and regulations. This changed the main assumption of most schema mapping solutions: suggestions curated by users in a primarily manual process. Our main challenges were: (1) how to provide an automated solution that required reasonable interaction with the user, while being able to map thousands of schemas; and (2) how to design matching algorithms robust enough to accommodate different languages, formats, reference master data, and data units and granularity.

图像

图 21.2  同一个 Mike 的多种表示!

Figure 21.2  Many representations for the same Mike!

记录重复数据删除。记录链接、实体解析和重复记录删除是描述统一描述同一现实世界实体的多个提及或数据库记录的需要的几个术语。例如,“Michael Stonebraker”信息可以用不同的方式表示。考虑图 21.2中的示例(为了简单起见,它显示了单个模式)。很容易看出这四张记录都是关于迈克的,但它们看起来却很不同。事实上,除了第四条记录中 Mike 名字的拼写错误之外,所有这些值都是正确的或在某个时间点是正确的。虽然人类很容易判断这样的集群是否指的是同一个实体,但对于机器来说却很难。因此,我们需要设计更强大的算法,能够在存在错误、不同的表示以及粒度和时间参考不匹配的情况下找到此类匹配。

Record Deduplication. Record linkage, entity resolution, and record deduplication are a few terms that describe the need to unify multiple mentions or database records that describe the same real-world entity. For example, “Michael Stonebraker” information can be represented in different ways. Consider the example in Figure 21.2 (which shows a single schema for simplicity). It’s easy to see that the four records are about Mike, but they look very different. In fact, except for the typo in Mike’s name in the fourth record, all these values are correct or were correct at some point in time. While it’s easy for humans to judge if such a cluster refers to the same entity, it’s hard for a machine. Therefore, we needed to devise more robust algorithms that could find such matches in the presence of errors, different presentations, and mismatches of granularity and time references.

问题是老问题了。在过去的几十年里,研究界提出了许多相似性函数、用于区分匹配与不匹配的监督分类器,以及用于收集同一组中的匹配对的聚类算法。与模式映射类似,当前的算法可以处理数千条记录(或数百万条记录,但被划分为数千条记录的不相交组!)但是,考虑到收集的大量脏数据,并且存在上述模式-绘图问题——我们现在面临多重挑战,包括:

The problem is an old one. Over the last few decades, the research community came up with many similarity functions, supervised classifiers to distinguish matches from non-matches, and clustering algorithms to collect matching pairs in the same group. Similar to schema mapping, current algorithms can deal with a few thousands of records (or millions of records but partitioned in disjointed groups of thousands of records!) However, given the massive amount of dirty data collected—and in the presence of the aforementioned schema-mapping problem—we now faced multiple challenges, including:

1. 如何缩放二次问题(我们必须将每条记录与所有其他记录进行比较,因此计算复杂度是记录数量的二次方);

1.  how to scale the quadratic problem (we have to compare every record to all other records, so computational complexity is quadratic in the number of records);

2. 如何训练和构建处理细微相似之处的机器学习分类器,如图21.2所示;

2.  how to train and build machine learning classifiers that handle the subtle similarities as in Figure 21.2;

3.  鉴于匹配通常很少见,如何让人类和领域专家参与提供训练数据;和

3.  how to involve humans and domain experts in providing training data, given that matches are often rare; and

4. 如何在一个集成工具中利用所有领域知识以及先前开发的规则和匹配器。

4.  how to leverage all domain knowledge and previously developed rules and matchers in one integrated tool.

Mike、Daniel 和 Alex 启动了专注于模式映射的项目,而 George 和我专注于重复数据删除问题。但很容易看出这两个问题有多么相似和相关。就相似性而言,这两个问题都是在找到匹配对之后(模式映射的情况下是属性,重复数据删除的情况下是记录)。

Mike, Daniel, and Alex had started the project focusing on schema mapping, while George and I had focused on the deduplication problem. But it was easy to see how similar and correlated these two problems were. In terms of similarity, both problems are after finding matching pairs (attributes in the case of schema mapping, records in the case of deduplication).

我们很快发现我们创建的大多数构建块都可以重复使用并用于解决这两个问题。就相关性而言,大多数记录匹配器依赖于它们比较的两个记录的某些已知模式(以便进行同类比较);然而,统一模式需要某种模式映射,即使不完整。

We quickly discovered that most building blocks we created could be reused and leveraged for both problems. In terms of correlation, most record matchers depend on some known schema for the two records they compare (in order to compare apples to apples); however, unifying schemas requires some sort of schema mapping, even if not complete.

出于这个原因和许多其他原因,Data Tamer 的诞生是我们整合这些活动并设计用于数据统一的核心匹配和集群构建块的愿景,这些构建块可以:(1)用于不同的统一活动(以避免零碎的解决方案);(2) 扩展到海量的来源和数据;(3) 让人类作为驱动程序,指导机器以可信且可解释的方式构建分类器并大规模应用统一。

For this and many other reasons, Data Tamer was born as our vision for consolidating these activities and devising core matching and clustering building blocks for data unifications that could: (1) be leveraged for different unification activities (to avoid piecemeal solutions); (2) scale to a massive number of sources and data; and (3) have human in the loop as a driver to guide the machine in building classifiers and applying the unification at large scale, in a trusted and explainable way.

与此同时,Stan Zdonik(来自布朗大学)和 Mitch Cherniack(来自布兰代斯大学)同时与 Alex Pagan 合作开展专家采购:众包,但应用于企业内部并承担专业知识水平。这个想法是当算法对匹配的置信度低于阈值时,在循环中使用人类来解决歧义。他们同意将他们的模型应用于虾虎鱼数据,以统一游客的娱乐活动。

Meanwhile, Stan Zdonik (from Brown University) and Mitch Cherniack (from Brandeis University) were simultaneously working with Alex Pagan on expert sourcing: crowdsourcing, but applied inside the enterprise and assuming levels of expertise. The idea was to use a human in the loop to resolve ambiguities when the algorithm’s confidence on a match falls below a threshold. They agreed to apply their model to the Goby data to unify entertainment events for tourists.

我们的学术原型比 Goby 手工编写的代码效果更好,并且与 Verisk Health 数据专业服务的结果相当。它似乎提供了一种有前途的方法来管理和统一诺华数据(如第 7 章所述)。

Our academic prototype worked better than the Goby handcrafted code and equaled the results from a professional service on Verisk Health data. And it appeared to offer a promising approach to curate and unify the Novartis data (as mentioned in Chapter 7).

愿景、原型和结果在论文“大规模数据管理:数据驯服系统”中进行了描述,该论文在 CIDR 2013(加利福尼亚州第六届创新数据系统研究双年会)上发表 [Stonebraker 2013]。

The vision, prototype, and results were described in the paper “Data Curation at Scale: The Data Tamer System,” presented at CIDR 2013, the Sixth Biennial Conference on Innovative Data Systems Research in California [Stonebraker 2013].

公司:塔姆尔公司

The Company: Tamr Inc.

鉴于 Mike 在系统构建和创办公司方面的历史,不难看出他对 Data Tamer 的发展方向。当我们构建原型时,他明确指出测试“这个”的唯一方法是将其推向市场并创办一家由风险投资支持的公司来做到这一点。迈克确切地知道谁将担任首席执行官:他的长期朋友和商业伙伴安迪·帕尔默(Andy Palmer),他参与了多家 Stonebraker 初创企业(见第 8 章。他们当时最近的合作是数据库引擎初创公司 Vertica(2011 年被惠普 (HP) 收购,现在是 Micro Focus 的一部分)。

Given Mike’s history with system-building and starting companies, it wasn’t hard to see where he was going with Data Tamer. While we were building the prototype, he clearly indicated that the only way to test “this” was to take it to market and to start a VC-backed company to do so. And Mike knew exactly who would run it as CEO: his long-term friend and business partner, Andy Palmer, who has been involved with multiple Stonebraker start-ups (see Chapter 8). Their most recent collaboration at the time was the database engine start-up Vertica (acquired in 2011 by Hewlett-Packard (HP) and now part of Micro Focus).

Tamr 于 2013 年在马萨诸塞州剑桥的哈佛广场成立,由 Andy 和原来的 Data Tamer 研究团队作为联合创始人。2013 年,我结束了假期,回到了滑铁卢大学,George 搬到了波士顿,成为第一个构建商业 Tamr 产品的全职软件开发人员,Daniel 和 Alex 离开研究生院加入 Tamr 1作为全职员工。

Tamr was founded in 2013 in Harvard Square in Cambridge, Massachusetts, with Andy and the original Data Tamer research team as co-founders. The year 2013 was also when I finished my leave and went back to the University of Waterloo and George moved to Boston to start as the first full-time software developer to build the commercial Tamr product, with Daniel and Alex leaving grad school to join Tamr1 as full-time employees.

多年来,我参与了一些初创企业。我目睹了所有的艰苦工作以及有时与筹集种子资金相关的焦虑和压力。但 Tamr 的情况有所不同:迈克和安迪这两位退伍军人的信誉在快速、稳健的起步中发挥了基础性作用,获得了 Google Ventures 和 New Enterprise Associates (NEA) 的大力支持。聘请世界一流的团队来构建 Tamr 产品的工作已经开始。

Over the years, I have been involved in few start-ups. I witnessed all the hard work and the amount of anxiety and stress sometimes associated with raising the seed money. But things were different at Tamr: The credibility of the two veterans, Mike and Andy, played a fundamental role in a fast, solid start, securing strong backing from Google Ventures and New Enterprise Associates (NEA). Hiring a world-class team to build the Tamr product was already under way.

正如迈克在关于如何建立初创企业的章节中描述的模型一样,我们的第一个客户很快就跟进了。Tamr 解决的数据统一问题对于许多大型组织来说是一个真正的痛点,大多数 IT 部门花费数月时间试图解决任何给定项目的问题。然而,数据集成和数据质量的一个根本问题是,要显示启动这些大型项目(如 Tamr)的投资回报,需要付出巨大的努力。由于 Tamr 位于更上游(靠近分散在企业各处的孤立数据源),我们努力展示统一企业最终产品或主业务线的所有数据的真正好处 - 除非最终产品是整理的数据本身,就像 Tamr 的早期采用者之一汤森路透的情况一样,

True to Mike’s model described in his chapter on how to build start-ups, our first customer soon followed. The problem Tamr tackled, data unification, was a real pain point for many large organizations, with most IT departments spending months trying to solve it for any given project. However, a fundamental problem with data integration and data quality is the non-trivial effort required to show return on investment in starting these large-scale projects, like Tamr. With Tamr living much further upstream (close to the silo-ed data sources scattered all over the enterprise), we worked hard to show the real benefit of unifying all the data on an enterprise’s final product or main line of business—unless the final product is the curated data itself, as in the case of one of Tamr’s early adopters, Thomson Reuters, which played a key role in the early stages of Tamr creation.

汤森路透 (TR) 是一家以精选的高质量商业数据业务的公司,因此自然而然地成为了 Tamr 的早期采用者。Tamr 软件在 TR 中的首次部署侧重于驱动多个业务的多个关键数据集中的重复记录删除。与客户内部基于规则的记录匹配器相比,Tamr 基于机器学习的方法(明智地让 TR 专家参与标记和验证结果)被证明要优越得多。的质量结果与人类策展人的结果相匹配,其规模需要人类花费数年才能完成[Collins 2016]。

Thomson Reuters (TR), a company in which curated and high-quality business data is the business, was thus a natural early adopter of Tamr. The first deployment of Tamr software in TR focused on deduplicating records in multiple key datasets that drive multiple businesses. Compared to the customer’s in-house, rule-based record matchers, Tamr’s machine learning-based approach (which judiciously involves TR experts in labeling and verifying results) proved far superior. The quality of results matched those of human curators on a scale that would have taken humans literally years to finish [Collins 2016].

随着首次部署的成功,第一个产品发布进展顺利。Tamr 于 2014 年 5 月正式推出,拥有约 20 名全职员工(当然,其中大部分是工程师),并为多个组织提供了一系列概念验证。

With the success of the first deployment, the first product release was shaping up nicely. Tamr officially launched in May 2014 with around 20 full-time employees (mostly engineers, of course), and a lineup of proofs of concepts for multiple organizations.

正如 Mike 在第 8 章中所描述的那样,在 TR 作为“灯塔客户”、Andy Palmer 作为“成人主管”以及 Google Ventures 和 NEA 的大力支持下,创建 Tamr 公司的第 3、4 和 5 步已经完成。

As Mike describes in Chapter 8, with TR as the “Lighthouse Customer,” Andy Palmer the “adult supervisor,” and the strong support of Google Ventures and NEA, Steps 3, 4, and 5 of creating Tamr the company were complete.

越来越多的企业很快意识到,他们的数据与 TR 一样面临着同样的问题和商机。在我撰写本文时,Tamr 的客户包括通用电气、惠普、诺华、默克、丰田汽车欧洲公司、安进和罗氏。包括 GE、HP、Massachusetts Mutual Insurance 和 TR 在内的一些客户继续通过其风险投资部门投资我们公司,进一步验证了我们的软件对许多不同行业的重要性。

More enterprises soon realized that they faced the same problem—and business opportunity—with their data as TR. As I write this, Tamr customers include GE, HP, Novartis, Merck, Toyota Motor Europe, Amgen, and Roche. Some customers—including GE, HP, Massachusetts Mutual Insurance, and TR—went on to invest in our company through their venture-capital arms, further validating the significance of our software for many different industries.

2017 年 2 月,美国专利商标局向 Tamr 颁发了一项专利 (US9,542,412) [Tamr 2017],涵盖其企业级数据统一平台的基本原理。该专利名为“大规模数据管理的方法和系统”,描述了一种集成大量数据源的综合方法,通过使用机器学习技术并辅以人类专业知识对数据进行规范化、清理、集成和重复数据删除。Tamr 的专利描述了该软件中实现的多项功能和优势,包括:

In February 2017, the United States Patent and Trademark Office issued Tamr a patent (US9,542,412) [Tamr 2017] covering the principles underlying its enterprise-scale data unification platform. The patent, entitled “Method and System for Large Scale Data Curation,” describes a comprehensive approach for integrating a large number of data sources by normalizing, cleaning, integrating, and deduplicating them using machine learning techniques supplemented by human expertise. Tamr’s patent describes several features and advantages implemented in the software, including:

• 用于获取机器学习算法训练数据的技术;

•  the techniques used to obtain training data for the machine learning algorithms;

• 以整体方式链接属性和数据库记录的统一方法;

•  a unified methodology for linking attributes and database records in a holistic fashion;

• 出于可扩展性和高数据量考虑,可采用多种方法来修剪大量候选匹配项;和

•  multiple methods for pruning the large space of candidate matches for scalability and high data volume considerations; and

• 为数据管理生命周期各个阶段的专家提出高度相关的问题的新颖方法。

•  novel ways to generate highly relevant questions for experts across all stages of the data curation lifecycle.

凭借我们的技术、我们的品牌客户、我们的管理团队、我们的投资者和我们的文化,我们已经能够吸引来自行业和大学的顶尖人才。2015 年 11 月,我们公司被《波士顿环球报》评为排名第一的小公司。

With our technology, our brand-name customers, our management team, our investors, and our culture, we’ve been able to attract top talent from industry and universities. In November 2015, our company was named the #1 small company to work for by The Boston Globe.

迈克的影响:三个教训。

Mike’s Influence: Three Lessons Learned.

在过去五年与迈克的合作中,我从他身上学到了很多东西。以下是我学到的三个重要教训,总结了他对我的影响,并表明了他的影响力和领导力如何塑造了 Tamr 的成功。

I learned a lot from Mike over the last five years collaborating with him. Here are three important lessons that I learned, which summarize his impact on me and are indicative of how his influence and leadership have shaped Tamr’s success.

第 1 课:用系统解决实际问题

Lesson 1: Solve Real Problems with Systems

Tamr(与 Mike 的其他初创公司相比)的一个显着区别在于该问题的历史悠久且研究深入。这仍然是我从迈克那里学到的最大的教训:如果现实世界的应用程序无法实现,那么我们认为问题解决了多少,关于该主题发表了多少篇论文,或者该主题有多“老”并不重要。有效且无缝地使用解决问题的系统,这就是要解决的问题事实上,这是迈克最喜欢的问题类型。事实上,我们感到自豪的是,通过专注于规模挑战和创建可重复使用的构建模块,我们能够利用和转移研究界在过去几十年中的集体努力,以供工业界(包括大型企业)实际采用。特大型企业数量。

A distinctive difference of Tamr (as compared to Mike’s other start-ups) is how old and well-studied the problem was. This is still the biggest lesson I learned from Mike: It doesn’t really matter how much we think the problem is solved, how many papers were published on the subject, or how “old” the subject is, if real-world applications cannot effectively and seamlessly use a system that solves the problem, it is the problem to work on. In fact, it is Mike’s favorite type of problem. Indeed, we’re proud that, by focusing on the challenge of scale and creating reusable building blocks, we were able to leverage and transfer the collective effort of the research community over the last few decades, for practical adoption by industry—including a large number of mega enterprises.

第二课:专注,坚持不懈

Lesson 2: Focus, Relentlessly

从第一天起,Mike 对 Tamr 将解决(和不会解决)的挑战类型的影响就很大。在 Tamr 的早期,典型的讨论通常如下。

Mike’s influence on the type of challenges Tamr will solve (and won’t) was strong from Day One. In the early days of Tamr, a typical discussion often went as follows.

团队:“迈克,我们有一个关于如何使用这个聪明的算法 Y 启用功能 X 的好主意。”

Team: “Mike, we have this great idea on how to enable Feature X using this clever algorithm Y.”

Mike(经常不耐烦):“太复杂了……让它变得更简单……非常适合版本 10……我们可以恢复规模吗?”

Mike (often impatiently): “Too complicated … Make it simpler … Great for Version 10 … Can we get back to scale?”

我经常通过迈克分配给实施想法的版本号来衡量我们将想法转化为产品的进度!(当然,越低越好)。他在判断实用性和客户采用可能性方面的令人印象深刻的技能是迈克在指导构建可采用且真正有用的产品方面最强大的技能之一。

I have often measured our progress in transferring ideas to product by the version number Mike assigns to an idea for implementation! (Lower being better, of course). His impressive skill in judging the practicality and the probability of customer adoption is one of Mike’s strongest skills in guiding the construction of adoptable and truly useful products.

第三课:不要发明问题。曾经

Lesson 3: Don’t Invent Problems. Ever

迈克只是讨厌发明问题。如果这不是某人的痛点,那么它并不重要。对于我们许多人来说,尤其是在学术界,这可能是一个有争议的前提。在学术界,争论常常是关于创新和解决基本理论挑战的问题,这些挑战可以为新的实际问题打开大门,等等。

Mike simply hates inventing problems. If it isn’t somebody’s pain point, it is not important. This can be a controversial premise for many of us, especially in academia. Far too often in academia, the argument is about innovation and solutions to fundamental theoretical challenges that can open the door for new practical problems, and so on.

在发现问题时,迈克给我的教训是不要以某种方式被说服。相反,只需采取极端立场并做出最大的有形的与其产生影响。迈克花了很多时间倾听客户、行业从业者、现场工程师和产品经理的意见。这些都是迈克的挑战来源,他的小秘密是始终寻求带来最大的回报。听起来很简单,与这些不同的人才、角色和性格的人交谈是一门艺术,需要经验和“软”技能的良好结合。

In identifying problems, my lesson from Mike was not to be convinced one way or another. Instead, simply take an extreme position and make the biggest tangible impact with it. Mike spends a lot of his time listening to customers, industry practitioners, field engineers, and product managers. These are Mike’s sources of challenges, and his little secret is to always look to deliver the biggest bang for the buck. As easy as it sounds, talking to this diverse set of talents, roles, and personalities is an art, requiring a good mix of experience and “soft” skills.

观察迈克极大地影响了我收获、判断和处理研究问题的方式,不仅在 Tamr,而且在滑铁卢的研究小组也是如此。这些教训也解释了他自己对学术界和工业界的一长串贡献,值得计算的最高荣誉。

Watching Mike has greatly influenced the way I harvest, judge, and approach research problems, not only at Tamr but also in my research group at Waterloo. These lessons also explain the long list of his own contributions to both academia and industry to deserve computing’s highest honor.

1 . 商业名称从 Data Tamer 更改为 Tamr,因为 Data Tamer 已被占用。

1. The commercial name was changed from Data Tamer to Tamr, as Data Tamer had already been taken.

22

22

BigDAWG Polystore 系统

The BigDAWG Polystore System

蒂姆·马特森、珍妮·罗杰斯、亚伦·J·埃尔莫尔

Tim Mattson, Jennie Rogers, Aaron J. Elmore

对于我们许多人来说,BigDAWG Polystore 系统是我们在英特尔大数据科学技术中心 (ISTC) 期间与 Mike Stonebraker 合作的最高成就。也许解释这个陈述的最好方法是将其分解为其组成部分。

The BigDAWG polystore system is for many of us the crowning achievement of our collaboration with Mike Stonebraker during the years of the Intel Science and Technology Center (ISTC) for Big Data. Perhaps the best way to explain this statement is to break it down into its constituent components.

大数据 ISTC

Big Data ISTC

英特尔大数据科学技术中心 (ISTC) 是由英特尔资助的多所大学合作项目,历时五年(2012 年至 2017 年)。这个想法是,某些问题是如此巨大和复杂,以至于需要进行集中调查,不受行业产品周期或学术 NSF 资助追逐的限制。当面临此类问题时,英特尔介入并资助一组教授在三到五年内解决这些问题。该研究是开放知识产权的,或者用行业语言来说,是竞争前研究,旨在进一步发展某个领域的最新技术,而不是创造特定的产品。

The Intel Science and Technology Center (ISTC) for Big Data was a multi-university collaboration funded over five years (2012–2017) by Intel. The idea was that certain problems are so big and so complex that they need a focused investigation free from the constraints of industry product cycles or academic NSF grant-chasing. When faced with such problems, Intel steps in and funds a group of professors over a three- to five-year period to address those problems. The research is open-IP or, in the language of industry, pre-competitive research designed to further the state of the art in a field rather than create specific products.

大数据,无论这对词对你意味着什么,显然都属于这一类问题。2012 年,英特尔与麻省理工学院的 Sam Madden 和 Mike Stonebraker 合作推出了大数据 ISTC。该中心包括麻省理工学院、华盛顿大学、布朗大学、波特兰州立大学、加州大学圣巴巴拉分校和田纳西大学的研究小组。随着时间的推移,角色阵容发生了变化。我们失去了加州大学圣巴巴拉分校的团队,并增加了卡内基梅隆大学、西北大学和芝加哥大学的研究小组。

Big Data, whatever that pair of words means to you, clearly falls into this category of problem. In 2012, Intel worked with Sam Madden and Mike Stonebraker of MIT to launch the ISTC for Big Data. This center included research groups at MIT, the University of Washington, Brown University, Portland State University, UC Santa Barbara, and the University of Tennessee. Over time the cast of characters changed. We lost the UC Santa Barbara team and added research groups at Carnegie Mellon, Northwestern University, and the University of Chicago.

本章的作者作为 ISTC 的一部分聚集在一起:我们中的一位(蒂姆·马特森)担任 ISTC 的英特尔首席研究员 (PI),其他人(珍妮·罗杰斯和亚伦·埃尔莫尔)担任麻省理工学院的博士后。项目结束时,蒂姆仍然在英特尔,但 Jennie 和 Aaron 分别是西北大学和芝加哥大学的助理教授,部分原因是我们在 ISTC 的工作取得了成功。

The authors of this chapter came together as part of this ISTC: one of us (Tim Mattson) as the Intel Principal Investigator (PI) for the ISTC, and the others (Jennie Rogers and Aaron Elmore) as postdocs at MIT. By the end of the project, Tim was still at Intel, but Jennie and Aaron, in part due to the success of our work in the ISTC, were assistant professors at Northwestern and the University of Chicago, respectively.

BigDAWG 的起源

The Origins of BigDAWG

BigDAWG 最初是大数据 ISTC 英特尔 PI 头脑中的一个概念。当英特尔 PI 被任命担任该职位时,该中心已经成立一年了。这是一个尴尬的角色,因为他的背景是高性能计算(HPC)和计算物理。数据是其他人担心的事情。超级计算机上的 I/O 系统通常非常糟糕,以至于在 HPC 中,您会不遗余力地选择不依赖于大量数据的问题。因此,高性能计算人员几乎从设计上就对数据管理知之甚少。

BigDAWG started as a concept in the mind of the Intel PI for the Big Data ISTC. The center was one year old when the Intel PI was drafted into that role. It was an awkward role since his background was in high-performance computing (HPC) and computational physics. Data was something other people worried about. The I/O systems on supercomputers were generally so bad that in HPC you went out of your way to pick problems that didn’t depend on lots of data. Hence, HPC people, almost by design, know little of data management.

然而,合作以及如何让合作发挥作用,是任何一个从事研究职业的人都需要学习的东西。让教授们为了共同的目标而共同努力是一种不自然的行为。它只有在刻意集中注意力的情况下才会发生,为了创造这种集中力,我们需要一个共同的目标,让每个人都团结起来。当时它还没有被称为 BigDAWG,但这个初始种子(许多项目将连接到其中的通用大数据解决方案堆栈)已经存在。在图22.1,我们重现了最早的 PowerPoint 幻灯片,代表了后来成为 BigDAWG 的内容。顶层是可视化和面向应用的项目。下面是各种数据存储,从纯粹的存储引擎到成熟的数据库管理系统。在这些数据存储下面是与其紧密耦合的数学库,以支持大数据分析。中间有一个“窄腰”,可以创建一个通用的中间件,将所有东西连接在一起。

Collaborations and how to make them work, however, is something anyone well into a research career learns. Getting professors to work together toward a common goal is an unnatural act. It happens only with deliberate focus, and to create that focus we needed a common target for everyone to rally around. It wasn’t called BigDAWG yet, but this initial seed—common Big Data solution stack into which many projects would connect—was there. In Figure 22.1, we reproduce the earliest PowerPoint slide representing what later became BigDAWG. At the top level were the visualization and applications-oriented projects. Underneath were various data stores ranging from pure storage engines to full-fledged database management systems. Underneath those data stores were math libraries tightly coupled to them to support big data analytics. And in the middle, a “narrow waist” that would create a common middleware to tie everything together.

这个细腰是一个简单的 API,每个人都可以使用它来将他们的系统连接在一起。至少在团队中的 HPC 人员看来,它是一个易于创建的简单软件层。我们只需要一个基于消息传递的 API,以便不同的包知道如何相互连接。几个研究生在一位明智的教授的指导下需要一两个季度的时间才能将其整合起来。

That narrow waist was a simple API everyone could use to connect their systems together. It would, at least in the mind of the HPC guy on the team, be a simple software layer to create. We just needed a messaging-based API so the different packages would know how to connect to each other. It would take a few graduate students under the direction of a wise professor a quarter or two to pull it together.

Mike 很快就意识到这个HPC 人员是多么天真。由于迈克多年来构建真实系统的经验,他立即认识到这项工作的真正贡献在于“窄腰”。做到这一点将是一个大规模的项目。因此,随着每一次连续的会议和这个宏伟的解决方案PowerPoint堆栈的每一个连续版本,细腰变大,而身材的其他部分则缩小。我们最终得到了图 22.2中的图片,其中“窄腰”现在已成为主导系统的 API、岛、垫片和强制转换(稍后将对此进行解释)。

Mike quickly picked up on how naïve the HPC guy was. Because of Mike’s experience building real systems over the years, he immediately recognized that the real contribution of this work was that “narrow waist.” Getting that right would be a project of massive scale. Hence, with each successive meeting and each successive version of this grand solution PowerPoint stack, the narrow waist grew and the other parts of the figure shrank. We eventually ended up with the picture in Figure 22.2, where the “narrow waist” had now become the API, islands, shims, and casts that dominate the system (and which will be explained later).

图像

图 22.1  原始 BigDAWG 概念。

Figure 22.1  The original BigDAWG concept.

让教授们如此紧密地合作就像“放猫”。如果将它们粘合在一起的粘合剂是一个模糊的“解决方案堆栈”,那就很困难了。你需要一个人们可以联系到的名字,迈克想出了这个名字。Sam Madden、Mike 和谦虚的英特尔 PI 坐在 Mike 在 MIT 的办公室里。我们很清楚加州大学伯克利分校的大数据小组从其 BDAS 系统(发音为“bad ass”)中获得了关注。迈克说了一些大意:“他们可能是坏蛋,但我们是街上的大狗。” 这是以友好竞争的语气提出的,因为我们大多数人与加州大学伯克利分校的数据系统研究小组有着长期且持续的联系。然而,这个名字有正确的元素。它很轻松,并提出了我们希望做的事情的挑战:

Getting professors to work so closely together is akin to “herding cats.” It’s difficult if the glue holding them together is a nebulous “solution stack.” You need a name that people can connect to, and Mike came up with that name. Sam Madden, Mike, and the humble Intel PI were sitting in Mike’s office at MIT. We were well aware of the attention the Big Data group from UC Berkeley was getting from its BDAS system (pronounced “bad ass”). Mike said something to the effect “they may be bad asses, but we’re the big dog on the street.” This was offered in a tone of a friendly rivalry since most of us have long and ongoing connections to the data systems research groups at UC Berkeley. The name, however, had the right elements. It was light-hearted and laid down the challenge of what we hoped to do: ultimately create the system that would leap past current state of the art (well-represented by BDAS).

过了一段时间,这个名字才在我们三个人之间留下了深刻的印象。然而,一旦名称出现在 PowerPoint 中,它们就会拥有自己的生命。因此,在几次 ISTC 会议和一两次重复的 PowerPoint 演示中,这个名字就被保留下来了。我们正在为大数据创建 BigDAWG 解决方案堆栈。然而,就对计算机科学的长期贡献而言,这到底意味着什么,没有人清楚。这就需要迈克创造的另一个术语:polystore。

It took a while for the name to stick between the three of us. Names, however, take on a life of their own once they appear in PowerPoint. Hence, within a few ISTC meetings and a recurring PowerPoint presentation or two, the name stuck. We were creating the BigDAWG solution stack for Big Data. What exactly that meant in terms of a long-term contribution to computer science, however, wasn’t clear to anyone. That required yet another term that Mike coined: polystore.

图像

图 22.2  最终的 BigDAWG 概念。

Figure 22.2  The final BigDAWG concept.

一种方法并不适用于所有情况以及对 Polystore 系统的追求

One Size Does Not Fit All and the Quest for Polystore Systems

“一刀切”体现了这样的理念:数据的结构和组织如此多样化,以至于无法通过单一数据存储有效地满足数据需求。这句著名的口号源于迈克构建针对特定用例优化的专用系统的职业生涯。他首先在 ICDE 2005 论文中与 Ugur Çetintemel 探讨了这个主题 [Stonebraker 等人。2005b],后来获得了“时间考验奖”(2015)。Mike 通过 CIDR 2007 的基准测试结果进一步验证了这一大胆的断言 [Stonebraker 等人。2007a]。

“One size does not fit all” embodies the idea that the structure and organization of data is so diverse that you cannot efficiently address the needs of data with a single data store. This famous slogan emerged from Mike’s career of building specialized systems optimized for a particular use case. He first explored this topic with Ugur Çetintemel in an ICDE 2005 paper [Stonebraker et al. 2005b], which later won the Test of Time Award (2015). Mike further validated this bold assertion with benchmarking results in CIDR 2007 [Stonebraker et al. 2007a].

这些专门的“Stonebraker商店”的设计偏向简约和优雅。迈克会利用这样的启发:除非他认为自己能够比“大象”获得至少一个数量级的性能提升,否则不值得建立一家专门商店。1

The design of these specialized “Stonebraker Stores” favored simplicity and elegance. Mike would use the heuristic that it was not worth building a specialized store unless he thought he could get at least an order of magnitude of performance improvement over “the elephants.”1

十年后,我们看到了专业数据管理系统的爆炸式增长,每个系统都有自己的性能优势和劣势。使情况进一步复杂化的是,许多系统引入了自己的数据模型,例如数组存储、图形存储和流引擎。每个数据模型都有自己的语义,并且大多数数据模型至少有一种特定于领域的语言。

A decade later, we have seen an explosion in specialized data management systems, each with its own performance strengths and weaknesses. Further complicating this situation, many of these systems introduced their own data models, such as array stores, graph stores, and streaming engines. Each data model had its own semantics and most had at least one domain-specific language.

过多的数据存储至少引发了两个问题。首先,这鼓励组织将数据分散到具有不同功能和性能配置文件的多个数据孤岛中。其次,程序员需要学习大量新的语言和数据模型。

At least two problems arose from this plethora of data stores. First, this encouraged organizations to scatter their data across multiple data silos with distinct capabilities and performance profiles. Second, programmers were deluged with new languages and data models to learn.

“细腰”如何让这些不同的系统(相对)无缝地协同工作?我们如何利用生态系统中每个数据存储的性能优势?我们怎样才能在数据所在的地方见到数据并在程序员所在的地方见到他们?在新的数据管理系统不断涌现的时代,这种将它们统一起来的尝试是愚蠢的吗?

How could the “narrow waist” make these many disparate systems work together (relatively) seamlessly? How could we take advantage of the performance strengths of each data store in an ecosystem? How could we meet the data where it was and meet the programmers where they were? In an age where new data management systems are becoming available all the time, is this attempt at unifying them all a fool’s errand?

这些问题主导了我们 BigDAWG 的工作,我们的梦想是将许多数据管理系统整合到一个通用 API 背后。Mike 创造了“polystore”这个名字来描述我们希望构建的东西。在最简单的层面上,polystore 是一个数据管理系统,它在单个 API 后面公开多个数据存储。然而,我们认为,要充分实现“一刀切”所隐含的方向,我们需要进一步深化定义:将 Polystore 与数据联合等相关概念区分开来。数据联合系统将多个数据库引擎统一在单个数据模型(最常见的是关系模型)背后。然而,在数据联合系统的经典思想中,系统中包含的数据库引擎是完全独立和自治的。

These questions dominated our work on BigDAWG, with the dream of bringing together many data management systems behind a common API. Mike coined the name “polystore” to describe what we hoped to build. At the simplest level, a polystore is a data management system that exposes multiple data stores behind a single API. We believe, however, that to fully realize the direction implied by “one size does not fit all,” we need to take the definition further: to distinguish a polystore from related concepts such as data federation. A data federation system unifies multiple database engines behind a single data model, most commonly the relational model. In the classic idea of a data federation system, however, the database engines contained in the system are completely independent and autonomous. The data federation layer basically creates a virtual single system without fundamentally changing anything inside the individual engines or how data is mapped to them.

当我们使用术语“polystore”时,我们指的是由紧密集成的数据存储组成的单个系统。数据存储可以是功能齐全的 DBMS 或专用存储引擎(例如基于 TileDB 阵列的存储引擎)。数据可以在数据存储之间移动,因此可以在附加到每个引擎的数据模型之间进行转换。查询通过“岛”(即跨越数据存储的虚拟数据模型)提供位置独立性。或者,当需要数据存储之一的特定功能时,可以将查询定向到特定引擎。因此,我们看到 Polystore 系统的根本挑战是平衡位置独立性和特殊性。

When we use the term polystore, we refer to single systems composed of data stores that are tightly integrated. A data store may be a fully featured DBMS or a specialized storage engine (such as the TileDB array-based storage engine). Data can be moved between data stores and therefore transformed between data models attached to each engine. Queries provide location independence through an “island” in other words, a virtual data model that spans data stores. Alternatively, when a specific feature of one of the data stores is needed, a query can be directed to a particular engine. Hence, we see that the fundamental challenge of a polystore system is to balance location independence and specificity.

虽然我们第一次从 Mike 那里听到“polystore”这个词(他首先创造了这个术语),但这个术语描述了人们一直在构建的系统有一段时间(正如最近[Tan et al. 2017]中的调查)。早期的 Polystore 系统,例如 Polybase [DeWitt 等人。2013] 和 Miso [LeFevre 等人。2014],专注于混合大数据系统和关系数据库以加速分析查询。此外,IBM 的 Garlic 项目 [Carey 等人。1995]研究了单个联合系统中对多个数据模型的支持。Myria 项目 [Halperin 等人。2014]华盛顿大学的 Polystore 系统强调位置独立性,通过扩展关系模型使所有数据存储可用。

While we first heard the word “polystore” from Mike—who coined the term in the first place—this term is descriptive of systems people have been building for a while (as recently surveyed in [Tan et al. 2017]). Early polystore systems, such as Polybase [DeWitt et al. 2013] and Miso [LeFevre et al. 2014], focused on mixing big data systems and relational databases to accelerate analytic queries. In addition, IBM’s Garlic project [Carey et al. 1995] investigated support for multiple data models in a single federated system. The Myria project [Halperin et al. 2014]at the University of Washington is a polystore system that emphasizes location independence by making all of the data stores available through an extended relational model.

迈克采取了更务实的观点,并强调他相信不存在“查询世界语”。使用 Polystore 系统的真正原因是公开底层数据存储的特​​殊功能,而不是基于通用接口限制功能。因此,我们为 BigDAWG 构建了一个查询接口,封装了每个底层数据存储的不同查询语言。这种方法提供了底层数据存储语义的联合,并使客户端能够以他们选择的语言发出查询。

Mike took a more pragmatic point and stressed that he believed there is no “Query Esperanto.” The very reason to go with a polystore system is to expose the special features of the underlying data stores instead of limiting functionality based on the common interface. Hence, we built a query interface for BigDAWG that encapsulated the distinct query languages of each of the underlying data stores. This approach offered the union of the semantics of the underlying data stores and enabled clients to issue queries in the languages of their choice.

把它们放在一起

Putting it All Together

BigDAWG 信奉“一刀切”的口号,因此人们可以从专用存储引擎的功能中受益。选择正确的存储引擎并在物理上集成系统非常复杂。我们设计 BigDAWG 的目标是简化用户和管理员的生活,同时不限制 BigDAWG 中数据存储的表现力或功能。我们在 SIGMOD Record 中介绍了这种新颖的架构 [Duggan 等人。2015a]。

BigDAWG embraced the “one size does not fit all” mantra so people could benefit from the features of specialized storage engines. Choosing the right storage engines and physically integrating the systems is extremely complex. Our goal in designing BigDAWG was to simplify the lives of users and administrators without limiting the expressiveness or functionality of the data stores within BigDAWG. We introduced this novel architecture in SIGMOD Record [Duggan et al. 2015a].

我们首先定义如何管理特定于每类存储引擎的抽象。我们通过定义岛屿的概念来做到这一点。岛是一个抽象数据模型和一组运算符,客户端可以使用它们查询 Polystore。我们实现了垫片来将“岛屿语句”翻译成每个系统支持的语言。在 BigDAWG 查询中,用户通过指定range来表示他或她正在调用的岛。例如,客户端在以下查询中调用关系范围:

We started by defining how we’d manage the abstractions specific to each class of storage engines. We did so by defining the concept of an island. An island is an abstract data model and a set of operators with which clients may query a polystore. We implemented shims to translate “island statements” into the language supported by each system. In a BigDAWG query, the user denotes the island he or she are invoking by specifying a scope. For example, a client invokes the relational scope in the following query:

关系(从传感器中选择平均值(温度))

RELATIONAL(SELECT avg(temperature) FROM sensor)

作用域是可组合的,因此可以将关系作用域与数组作用域结合起来,如下所示:

Scopes are composable, so one might combine the relational scope with an array scope such as the following:

ARRAY(乘(A,RELATIONAL(从传感器中选择平均值(温度)))

ARRAY(multiply(A, RELATIONAL(SELECT avg(temperature) FROM sensor))

岛屿提供位置独立性;换句话说,使用岛屿模型和运算符的单个查询对于给定的查询/数据对返回相同的答案无论连接到岛屿的哪个数据存储保存数据。这可能需要数据在数据存储之间移动,这是通过将一个存储的存储层映射到另一个存储的存储层的转换运算符来完成的。

An island offers location independence; in other words, a single query using the island’s model and operators returns the same answer for a given query/data pair regardless of which data store connected to the island holds the data. This may require data to move between data stores, which is accomplished through cast operators that map the storage layer of one store to that of another.

查询建模和优化

Query Modeling and Optimization

Polystore 系统执行的工作负载包括跨越多个岛屿的各种查询。如果我们能够找到并利用工作负载中的“最佳点”(如果专门针对特定存储引擎,执行速度会显着加快),我们就可以实现潜在的巨大性能优势。换句话说,我们需要在 BigDAWG 中构建一种对不同存储引擎的性能特征进行建模并捕获其查询处理系统的优点和缺点的功能。

Polystore systems execute workloads comprising diverse queries that span multiple islands. If we can find and exploit “sweet spots” in a workload, where execution is significantly faster if specialized to a particular storage engine, we can realize potentially dramatic performance benefits. In other words, we needed to build into BigDAWG a capability to model the performance characteristics of distinct storage engines and capture the strengths and weaknesses of their query processing systems.

为了将查询与存储引擎相匹配,我们吸收了用户的 BigDAWG 工作负载,并观察了其查询在不同系统中执行时的性能。在这里,我们的目标是建立一组查询类,每个查询类都具有跨引擎的预期性能配置文件。因此,当新查询到达时,我们将识别它们所属的类,并能够更好地决定它们将在哪里运行。

To match queries with storage engines, we took in a user’s BigDAWG workload and observed the performance of its queries when they execute in different systems. Here our goal was to establish a set of query classes each of which would have an expected performance profile across engines. Hence, when new queries arrived, we would identify the class to which they belonged and be able to make better decisions about where they would run.

为了了解查询的类别,我们以扩展模式执行查询,该过程类似于机器学习应用程序的训练阶段。这种扩展模式在与查询匹配的所有关联存储引擎上执行查询,并且 BigDAWG 记录每种情况下的性能。扩展执行可以一次性完成(当用户最初提交查询时),也可以在各个数据库中出现闲置资源时适时完成。训练期间收集的统计信息与总结查询结构和访问数据的签名配对。将这些结果与系统监控的其他查询进行比较,以维护给定查询类的系统性能的最新动态表示。

To learn a query’s class, we execute the queries in an expansive mode, a process akin to the training phase of machine learning applications. This expansive mode executes the query on all of the associated storage engines that match the query and BigDAWG records the performance in each case. Expansive execution may be done all at once—when the query is initially submitted by the use—or opportunistically when slack resources arise in individual databases. The statistics collected during training are paired with a signature summarizing the query’s structure and the data accessed. These results are compared to other queries the system has monitored to maintain an up-to-date, dynamic representation of the performance of the system for a given query class.

借助这些查询模型,BigDAWG 可以枚举不同存储引擎上的查询计划,以确定那些能够提供最高性能的查询计划。与传统的联合数据库一样,BigDAWG 的查询优化器使用有向无环图表示其查询计划。然而,规划多存储查询比数据联合系统更复杂,因为根据填充和转换的模式,BigDAWG 支持具有重叠功能的引擎。

Armed with these query models, BigDAWG enumerates query plans over the disparate storage engines to identify those that will deliver the highest performance. Like traditional federated databases, BigDAWG’s query optimizer represents its query plans using a directed acyclic graph. Planning polystore queries, however, is more complicated than for data federation systems since, depending on the pattern of shims and casts, BigDAWG supports engines with overlapping capabilities.

当 BigDAWG 优化器收到查询时,它首先解析该查询以提取其签名。然后,规划器将签名与之前见过的签名进行比较,并将其分配给预测的性能配置文件。BigDAWG 然后使用它配置文件与每个数据库上当前可用的硬件资源配对,以将查询分配给一个或多个数据存储。当 BigDAWG 执行工作负载时,它会累积有关查询性能的测量结果。这使得 BigDAWG 优化器能够逐步完善其签名分类和性能估计。由于 BigDAWG 通过分布式执行计划支持如此多样化的数据模型,因此系统的这一功能实现起来特别复杂。

When the BigDAWG optimizer receives a query, it first parses the query to extract its signature. The planner then compares the signature to ones it has seen before and assigns it to a predicted performance profile. BigDAWG then uses this profile paired with the presently available hardware resources on each database to assign the query to one or more data stores. As BigDAWG executes a workload, it accumulates measurements about the query’s performance. This allows the BigDAWG optimizer to incrementally refine its signature classification and performance estimates. This feature of the system was particularly complicated to implement since BigDAWG supports such diverse data models with a distributed execution plan.

数据移动

Data Movement

为了有效地在多个存储引擎之间进行查询,polystore 系统必须能够在其系统之间转换和迁移数据。此数据移动可能是临时的,以加速查询的某些部分或通过强制转换运算符利用用户所需的功能。或者,该移动可以是永久性的,以考虑负载平衡或其他工作负载驱动的优化。无论出于何种原因,高效、及时的数据迁移对于 BigDAWG 至关重要。

To effectively query among multiple storage engines, a polystore system must be able to transform and migrate data between its systems. This data movement may be temporary to accelerate some portion of a query or to leverage functionality required by the user via a cast operator. Alternatively, the move may be permanent to account for load balancing or other workload-driven optimizations. Regardless of the reason, efficient just-in-time data migration is critical for BigDAWG.

为了解决数据迁移问题,BigDAWG 包含一个数据迁移框架,用于在所有成员存储引擎之间转换数据。引擎和信息岛之间的垫片和转换运算符提供了更改数据模型所需的迁移框架和逻辑转换。BigDAWG 的迁移框架负责高效提取、转换、移动和加载数据,这是将组件与联邦系统区分开来的一个示例。BigDAWG 系统中的所有存储引擎都运行一个本地迁移代理,用于侦听查询控制器以了解何时移动数据。此信息包括目标引擎、逻辑转换规则以及有关数据本地存储方式的所需元数据。

To address data migration, BigDAWG includes a data migration framework to transform data between all member storage engines. The shims and cast operators between engines and islands of information provide the migration framework and logical transformations required to change data models. BigDAWG’s migration framework is responsible for doing efficient extraction, transformation, movement, and loading of data, which is an example of differentiating component from federated systems. All storage engines in a BigDAWG system have a local migration agent running that listens to the query controller for when to move data. This information includes the destination engine, logical transformation rules, and required metadata about how the data is locally stored.

由于大多数存储引擎都支持简单的 CSV 导出和导入功能,因此初始原型利用这种常见的基于文本的表示来移动数据。然而,这种格式需要大量的工作来解析数据并将其转换为目标二进制格式。我们探索让每个引擎都支持直接生成其他引擎的目标二进制格式的能力,我们发现这比基于 CSV 的方法快 400%。然而,对于支持 N 个数据库引擎的 BigDAWG 系统,编写自定义连接器意味着为系统编写 N2 个连接器,这会导致大量的代码维护需求。相反,迁移器选择了一种简洁的二进制中间表示形式,该表示形式仍然比基于 CSV 的迁移快 300%,并且只需要维护 N 个连接器。通过检查 CSV 导入器和导出器的源代码来生成连接器 [Haynes 等人。2016]。

As most storage engines support a naïve CSV export and import functionality, the initial prototype utilized this common text-based representation to move data. However, this format requires a great deal of work to parse and convert data into the destination binary format. We explored having each engine support the ability to directly generate the destination binary format of other engines, which we found to be as much as 400% faster than the CSV-based approach. However, for a BigDAWG system that supports N database engines, writing custom connectors means writing N2 connectors for the systems, which results in significant code maintenance requirements. Instead, the migrator settled on a concise binary intermediate representation that is still 300% faster than CSV-based migration and only requires N connectors to be maintained. In a related ISTC project, researchers from the University of Washington developed a system that used program-synthesis to automatically generate connectors by examining source code for CSV importers and exporters [Haynes et al. 2016].

我们投入了大量精力来优化迁移框架以实现高效的数据传输。我们探索了并行化、基于 SIMD(单指令多数据)的数据转换、轻量级压缩以及在存在多种将数据导入目标系统的方法时的自适应摄取。

Significant effort went into optimizing the migration framework for efficient data transfer. We explored parallelization, SIMD (single instruction multiple data) based data transformation, lightweight compression, and adaptive ingestion when multiple methods exist for getting data into the destination system.

BigDAWG 版本和演示2

BigDAWG Releases and Demos2

在项目早期,Mike 意识到我们需要一个公开演示作为“强制功能”,让分布在 ISTC 中的团队在 BigDAWG 系统上快速取得进展。许多 ISTC 参与者在 VLDB 2015 上共同进行了演示,该演示将关系数据库、数组数据库和基于文本的数据库结合起来,用于使用基于医学的数据集(MIMIC II 数据集3)的一系列工作流程。

Early in the project Mike realized that we needed a public demo as a “forcing function” to get the teams distributed across the ISTC to make progress quickly on our BigDAWG system. Many ISTC participants worked together for a demonstration at VLDB 2015 that coupled relational, array, and text-based databases for a series of workflows using a medical-based dataset (the MIMIC II dataset3).

迈克在管理这个雄心勃勃的项目过程中发挥了重要作用,他的团队分散且多元化。该团队概述了支持演示的初始原型,并构建了路线图以推动前进。迈克发挥了领导作用,帮助平衡雄心与现实,并确保不必要的范围蔓延不会妨碍组装一个比其各个部分更大的系统的机会。麻省理工学院林肯实验室的 Mike 和 Vijay Gadepally 定期组织黑客马拉松,来自西雅图、芝加哥、普罗维登斯和剑桥的团队成员聚集在麻省理工学院,帮助将团队和代码粘合在一起。

Mike was instrumental in managing this ambitious project with a distributed and diverse team. The team outlined the initial prototype to support the demo, and a roadmap was constructed to force the march forward. Mike played a leadership role to help balance ambition with reality and made sure that unnecessary scope creep did not hinder the chances of assembling a system that was greater than its individual parts. Mike and Vijay Gadepally from MIT Lincoln Laboratory organized regular hackathons where team members from Seattle, Chicago, Providence, and Cambridge assembled at MIT to help glue the team and code together.

演示系统的实验结果(总结在图 22.3中)在 VLDB 上以演示形式呈现 [Elmore 等人。2015] 并随后在 IEEE 高性能极限计算会议上发表论文 [Gadepally 等人。2016a]。这表明,与将所有数据移至一个 DBMS 或另一个 DBMS 相比,将数组数据保留在基于数组的 DBMS (SciDB) 中,将关系数据保留在关系 DBMS (MyriaX) 中可带来更好的性能。这一结果证明了 Polystore 概念的好处。

The experimental results from the demo system (summarized in Figure 22.3) was presented as a demo at VLDB [Elmore et al. 2015] and later published in a paper at the IEEE High Performance Extreme Computing Conference [Gadepally et al. 2016a]. This showed that leaving array data in an array-based DBMS (SciDB) and relational data in a relational DBMS (MyriaX) resulted in better performance compared to moving all the data into one DBMS or the other. This result demonstrated the benefits of the polystore concept.

对于我们的第二个演示,我们想要不受与医疗数据相关的隐私考虑的数据。我们选择了来自麻省理工学院奇泽姆实验室 ( http://chisholmlab.mit.edu ) 的数据。该小组利用从海洋收集的宏基因组数据来了解原绿球藻的生物学特性,原绿球藻是一种微小的海洋蓝藻,占地球大气中所有氧气的 15-20%。这些数据比我们第一个演示中的医疗数据集更加异构,并且包括一个新的流数据岛 S-Store [Meehan 等人。2015b],表示来自移动数据收集平台的实时数据。该演示已在 2017 年创新数据系统研究会议 (CIDR) 上展示 [Mattson 等人。2017]。

For our second demo, we wanted data that was free from the privacy considerations connected to medical data. We settled on data from the Chisholm Laboratory at MIT (http://chisholmlab.mit.edu). This group works with metagenomics data collected from the ocean to understand the biology of Prochlorococcus, a tiny marine cyanobacteria responsible for 15–20% of all oxygen in earth’s atmosphere. This data was more heterogeneous than the medical datasets in our first demo and included a new streaming data island, S-Store [Meehan et al. 2015b], to represent real-time data from a moving data collection platform. This demo was presented at the Conference on Innovative Data Systems Research (CIDR) in 2017 [Mattson et al. 2017].

图像

图 22.3   MIMIC II 数据集上的复杂分析工作流程的性能显示了将工作流程的不同部分与最适合操作/数据的数据存储相匹配的好处。

Figure 22.3  Performance of a complex analytic workflow over the MIMIC II dataset showing the benefit of matching different parts of the workflow to the data store best suited to the operation/data.

随着两个已完成的演示里程碑的完成,Mike 和 ISTC PI 朝着另一个开源“Mike 帽子上的羽毛”迈进。2017 年,ISTC 向公众发布了 BigDAWG 0.10 [Gadepally 等人。2017](见第31章)。麻省理工学院林肯实验室的 Vijay Gadepally 和 Kyle O'Brien 在将 BigDAWG 的组件集成到一个连贯的包中发挥了重要作用(参见第31 章)。我们接下来的工作是围绕 BigDAWG 建立一个社区,并希望随着研究人员下载该软件并在其基础上进行构建,参与其发展。

With two completed demo milestones, Mike and the ISTC PI drove toward another open-source “feather in Mike’s cap.” In 2017, the ISTC released BigDAWG 0.10 to the general public [Gadepally et al. 2017] (see Chapter 31). Vijay Gadepally and Kyle O’Brien of MIT Lincoln Laboratory played a major role in integrating the components that make up BigDAWG into a coherent package (see Chapter 31). Our job moving forward is to build a community around BigDAWG and hopefully participate in its growth as researchers download the software and build on it.

结束语

Closing Thoughts

我们通过 BigDAWG 实现了几个重要的研究目标。它出现的时候,存储引擎的格局极其分散,并提供了一种在简单易用的界面中统一许多不同系统的方法。Mike 带领我们探索如何充分利用这些多样化的数据管理产品。BigDAWG 本身也作为一个共同项目,ISTC 的成员可以团结起来,将他们的工作整合到一个更大的系统中。这两个演示表明,polystore 使其底层存储引擎大于各个部分的总和。然而,更大的问题是 BigDAWG 和 Polystore 概念是否会产生超出 ISTC 范围的长期影响,就这一点来说现在下结论还为时过早。我们希望围绕该系统出现一个开源社区。它在林肯实验室使用,我们希望吸引其他用户。为了帮助发展 Polystore 社区,我们团队(Vijay Gadepally、Tim Mattson 和 Mike Stonebraker)联手在 IEEE 大数据会议上组织了有关 Polystore 系统的研讨会系列。

We realized several important research goals with BigDAWG. It came at a time when the storage engine landscape was extremely fractured and offered a way to unify many disparate systems in a simple-to-use interface. Mike led us to explore how to get the most out of these diverse data management offerings. BigDAWG itself also served as a common project that the members of the ISTC could rally around to integrate their work into a larger system. The two demos showed that polystores make their underlying storage engines greater than the sum of their parts. The bigger question, however, is whether BigDAWG and the polystore concept will have a long-term impact that lives beyond the ISTC, and on that count it is too early to say. We are hopeful that an open-source community will emerge around the system. It is used at Lincoln Laboratory and we hope to attract other users. To help grow the polystore community, a group of us (Vijay Gadepally, Tim Mattson, and Mike Stonebraker) have joined forces to organize a workshop series on polystore systems at the IEEE Big Data Conference.

BigDAWG 未解决的最大问题之一是寻求查询世界语。我们相信这样的探索是值得且重要的。当前的 Big-DAWG 查询语言要求用户为查询的每个组件指定岛,并且查询内容与岛的数据模型相匹配。这为我们提供了最大的灵活性,可以充分利用任何单个岛屿的全部功能。它牺牲了“位置独立性”和编写查询的便利性,无论数据如何在存储引擎之间分布,查询都可以工作。我们避免了这个任务,因为我们选择首先构建一个工作系统。然而,这一探索很重要,可能会对 Polystore 系统的可用性产生深远的影响。

One of the biggest problems unaddressed by BigDAWG was the quest for a Query Esperanto. We believe such a quest is worthwhile and important. The current Big-DAWG query language requires that the user specify the islands for each component of a query and that the query contents match the data model for an island. This gives us maximum flexibility to take advantage of the full features of any single island. It sacrifices “location independence” and the convenience of writing queries that work regardless of how data is distributed between storage engines. We avoided this quest since we choose to build a working system first. The quest, however, is important and could have a profound impact on the usability of polystore systems.

我们已经提到了华盛顿大学 Myria 团队的工作。他们构建了一个工作系统,该系统使用扩展的关系代数来连接多个存储引擎数据模型。林肯实验室和华盛顿大学的一个密切相关的项目正在探索如何使用线性代数来定义统一的高级抽象,从而统一 SQL、NoSQL 和 NewSQL 数据库 [Kepner 等人,2017]。2016]。现在说这种方法是否会产生有用的系统还为时过早,但早期的理论结果很有希望,并为公共查询语言背后的多存储指明了光明的未来。

We already mentioned the work by the Myria team at the University of Washington. They have built a working system that uses an extended relational algebra to join multiple storage engine data models. A closely related project at Lincoln Laboratory and the University of Washington is exploring how linear algebra can be used to define a unifying high-level abstraction that can unify SQL, NoSQL, and NewSQL databases [Kepner et al. 2016]. It’s too early to say if this approach will lead to useful systems, but early theoretical results are promising and point to a bright future for polystores exposed behind a common query language.

1 . Mike 对主要 RDBMS 产品的术语。

1. Mike’s term for the dominant RDBMS products.

2 . 有关演示顺序的说明,请参阅第 31 章

2. For a description of the sequence of demonstrations, see Chapter 31.

3 . 医学数据集可在https://physionet.org/mimic2获取。

3. A medical dataset available at https://physionet.org/mimic2.

23

23

Data Civilizer:对数据发现、集成和清理的端到端支持

Data Civilizer: End-to-End Support for Data Discovery, Integration, and Cleaning

穆拉德·乌扎尼、南唐、劳尔·卡斯特罗·费尔南德斯

Mourad Ouzzani, Nan Tang, Raul Castro Fernandez

Mike Stonebraker 对如何缓解数据科学家在为高级数据分析准备数据时所面临的痛苦问题很感兴趣:即从数千个不同来源查找、准备、集成和清理数据集。当时,Mark Schreiber 是默克研究实验室 (Merck Research Laboratories) 的信息架构总监,该实验室是一家大型制药公司的研究实验室,他负责管理大约 100 名数据科学家。马克告诉迈克,数据科学家 98% 的时间都花在准备感兴趣的数据集的繁重工作上,每周只花一小时进行有用的工作来运行他们的分析。这远远超出了文献中通常报道的 60-80% [Brodie 2015]。2015 年,时任 IBM 加速发现实验室负责人的劳拉·哈斯 (Laura Haas) 描述了他们在为不同客户构建专门解决方案时如何解决这个问题。此外,Mike 从 Tamr 的客户那里听到了许多类似的战争故事,他的初创公司提供大规模数据管理的解决方案(请参阅第 21 章和30章)。

Mike Stonebraker was intrigued by the problem of how to ease the pain data scientists face getting their data ready for advanced data analytics: namely, finding, preparing, integrating, and cleaning datasets from thousands of disparate sources. At the time, Mark Schreiber was a director of information architecture at Merck Research Laboratories, a research lab for a large pharmaceutical company where he oversaw approximately 100 data scientists. Mark told Mike that the data scientists spend 98% of their time on grunt work preparing datasets of interest and only one hour per week on useful work for running their analyses. This is well beyond the 60–80% usually reported in the literature [Brodie 2015]. In 2015, Laura Haas, who then led IBM’s Accelerated Discovery Lab, described how they addressed this problem when building specialized solutions for different customers. In addition, Mike had heard numerous similar war stories from customers of Tamr, his startup that provides solutions for curating data at scale (see Chapters 21 and 30).

本章介绍了我们与 Mike 一起构建 Data Civilizer 的历程,Data Civilizer 是一个端到端平台,通过数据发现、数据清理、数据转换、模式集成和实体整合组件来支持数据科学家和企业应用程序的数据集成需求,与先进的工作流程系统一起,允许数据科学家按照用户定义的顺序编写、执行和改造这些组件。我们的旅程探索了不同的场景,解决了不同的挑战,并产生了一些很酷的想法来清理、改造、并以其他方式为严肃的数据分析准备数据,这就是 Data Civilizer 的诞生。

This chapter describes our journey with Mike in building Data Civilizer, an end-to-end platform to support the data integration needs of data scientists and enterprise applications with components for data discovery, data cleaning, data transformation, schema integration, and entity consolidation, together with an advanced workflow system that allows data scientists to author, execute, and retrofit these components in a user-defined order. Our journey explored different scenarios, addressed different challenges, and generated cool ideas to clean, transform, and otherwise prepare data for serious data analytics, which resulted in Data Civilizer.

我们需要使数据文明化

We Need to Civilize the Data

一段时间以来,企业数据存储库、数据库、数据仓库和数据湖中的数据已迅速变成数据沼泽,这是一组非结构化、不受治理和失控的数据集,数据很难找到、很难使用,并且可能会断章取义。1企业和公共数据存储库(例如data.gov)正在迅速成为大数据沼泽。例如,数据沼泽有许多没有沿袭信息的物化视图(即,没有其生成方法的多个冗余副本),并且被用作转储数据的地方,并且具有模糊的意图,以便将来对其进行处理。在这种背景下,未来永远不会到来。数据所有者很快就会失去对进入这些沼泽的数据的追踪,这给任何需要从中提取有价值见解的人带来了噩梦。为了将数据沼泽转化为管理良好的数据存储库,以便发现有价值的见解,我们决定构建数据文明器,以文明数据。在探讨此类系统的要求之前,我们使用一个代表我们在项目期间遇到的许多问题的示例来说明一个常见问题。

For some time, data in enterprise data repositories, databases, data warehouses, and data lakes have been rapidly turning into data swamps, a collection of unstructured, ungoverned, and out-of-control datasets where data is hard to find, hard to use, and may be consumed out of context.1 Enterprise and public data repositories (e.g., data.gov) are rapidly becoming Big Data swamps. For example, data swamps have many materialized views with no lineage information (i.e., multiple redundant copies without their method of generation) and are used as a place to dump data with a vague intent to do something with them in the future. In this context, the future never comes. Data owners quickly lose track of the data that goes into these swamps, causing nightmares for anyone needing to extract valuable insights from them. To convert data swamps into well-governed data repositories so that valuable insights could be discovered, we decided to build Data Civilizer, in order to civilize the data. Before exploring the requirements of such a system, we illustrate a common problem using an example representative of many that we encountered during the project.

分析师的日常生活

The Day-to-Day Life of an Analyst

为了确定特定变量是否与感兴趣的活动相关,您决定使用皮尔逊相关系数 (PCC) 来计算该相关性。此时,任务似乎非常明显:只需获取变量的数据和活动的证据,并对数据运行一个小型 PCC 程序。事实上,您可以使用许多现有库中的算法,而无需编写自己的实现。现在还不到上午 10 点,您就已经在想如何度过这一天剩下的时间了。毕竟,一旦获得此 PCC,您只需根据 PCC 分析的结果写几段文字来证明是否确实存在相关性。这是一项简单的任务。当然,此时现实会打击你。数据在哪里?在哪里可以找到有关变量和活动的数据?原则上,这似乎是显而易见的,但现在,您如何知道在多个数据库、湖泊和电子表格中的哪些可以找到必要的数据?您决定尽快处理并致电戴夫(Dave),他是一位过去一直在处理类似数据的员工。戴夫显然对你不请自来的来访感到不安,不知道数据在哪里,但幸运的是,他向你指出了丽莎,据他说,丽莎一直在对该指标进行“某种分析”。在下次互动中,你决定避免露出不好的表情,而是拿起电话给丽莎打电话。Lisa 会准确地向您指出可以找到数据的位置。在使用不同的密码又浪费了几分钟后,您找到了正确的凭据并访问数据库。又过了几分钟,需要编辑正确的 SQL 查询,瞧!您找到了指标数据,那么您就快完成了。正确的?不,那是一厢情愿的想法。

To determine if a particular variable is correlated with an activity of interest, you decide to use the Pearson Correlation Coefficient (PCC) to calculate that correlation. At this point, the task seems pretty obvious: just get the data for the variable and evidence of the activity and run a small PCC program on the data. In fact, you can use an algorithm in many existing libraries with no need to write your own implementation. It is not even 10 AM and you are already wondering how you will spend the rest of your day. After all, as soon as you get this PCC, you just need to write a couple of paragraphs justifying whether there is indeed a correlation or not, based on the result of the PCC analysis. It’s an easy task. Of course, at this point reality hits you. Where is the data? Where can you find the data about the variable and the activity? In principle, this seemed obvious, but now, how do you know in which of the multiple databases, lakes, and spreadsheets you can find the necessary data? You decide to get to it ASAP and call Dave, an employee who has been working with similar data in the past. Dave, who is visibly upset by your unsolicited visit, does not know where the data is, but fortunately he points you to Lisa, who according to him, has been doing “some kind of analysis” with the indicator. In your next interaction, you decide to avoid a bad look and instead pick up the phone to call Lisa. Lisa points you to exactly the place where you can find the data. After wasting a few more minutes with different passwords, you find the right credentials and access the database. After a few more minutes needed to edit the right SQL query and voilà! You found the indicator data, so you are almost done. Right? No. That was wishful thinking.

数据不完整。就目前情况而言,您无法得出任何具有统计意义的结论。您需要将此数据与另一个表连接,但连接两个表并不明显。如果您知道“verse_id”列是一个可以在其他位置存在的映射表中使用的 ID,并为您提供到“indicator_id”的映射,那么这将是显而易见的。这是连接表时必须使用的列。当然,问题还不止于此。这些指标的格式(由不同的人在不同的时间、出于不同的目的转储到数据库中)是不同的。因此,在使用数据之前,一种解决方案是将它们转换为通用表示。这是一个简单的转换,但它变成了一个痛苦的过程,您必须首先确保处理列中的任何缺失值,否则你的 PCC Python 程序会报错。但是,哦,现在是晚餐时间,所以此时您决定明天处理这个问题。

The data is incomplete. As it stands, you cannot make any statistically significant conclusions. You need to join this data with another table, but joining the two tables is not obvious. It would be obvious if you knew that the column “verse_id” is an ID that you can use in a mapping table that exists somewhere else and gives you the mapping to the “indicator_id”. This is the column you must use to join the tables. Of course, the problems do not end here. The formats of those indicators—which were dumped to the database by different people at different times and for different purposes—are different. So before using the data, one solution is to transform them to a common representation. It’s a simple transformation, but it becomes a painful process in which you must first make sure you deal with any missing values in the column, or otherwise your PCC Python program will complain. But, oh well, it is dinner time, so at this point you decide to deal with this tomorrow.

欢迎来到数据发现和准备这个乏味、无回报且复杂的问题:如何查找、准备、缝合(即加入和集成不同的数据集)和清理数据,以便快速有效地完成简单的 PCC 分析。事实证明,这些任务占用了分析师日常工作的大部分时间。当然,某些任务也有单点解决方案。例如,您绝对可以在列中找到异常值,通过软件帮助的自定义转换来清理一些数据,等等。然而,问题是所有这些阶段都是相互关联的,没有任何工具可以帮助您端到端地完成整个过程并帮助您了解下一步需要做什么。当然,许多工具的质量可能并不令人满意,不能满足您的特定需求。

Welcome to the unsexy, unrewarding, and complex problem of data discovery and preparation: how to find, prepare, stitch (i.e., join and integrate different datasets), and clean your data so that your simple PCC analysis can be done quickly and efficiently. It turns out that these tasks take most of the time analysts routinely spend in their day-to-day jobs. Of course, there are point solutions for some of the tasks. For example, you can definitely find outliers in a column, clean some data with custom transformations helped by software, and so on. However, the problem is that all of these stages are interconnected, and there is no tool that assists you end to end through the process and helps you understand what needs to be done next. Of course, the quality of many of the tools may be unsatisfactory and not meet your specific needs.

由 Mike 领导的 Data Civilizer 团队正在寻找解决这些问题的新方法。特别是 Data Civilizer [Deng 等人。2017a,费尔南德斯等人。2017a] 正在开发用于:

The Data Civilizer team, led by Mike, is looking into new ways of attacking these problems. In particular, Data Civilizer [Deng et al. 2017a, Fernandez et al. 2017a] is being developed to:

• 剖析数据集以发现列之间的句法和语义联系,并揭示数据集之间的数据沿袭;

• profile datasets to discover both syntactic and semantic linkage between columns and uncover data lineage between the datasets;

•  发现与当前任务相关的数据集——上例中的指标和活动(参见第 33 章);

•  discover datasets relevant to the task at hand—the indicator and the activity in the example above (see Chapter 33);

• 获取这些数据集的访问权限;

•  obtain access to these datasets;

• 联合数据集,将重复的数据记录放入集群中,并在集群外创建黄金记录;

•  unite datasets, put duplicate data records in clusters, and create golden records out of the clusters;

• 通过连接路径将数据集拼接在一起;

•  stitch together datasets through join paths;

• 预算有限的干净数据集;

•  clean datasets with a limited budget;

• 查询跨不同系统的数据集;和

•  query datasets that live across different systems; and

• 使用工作流引擎以任意方式组合上述组件。

•  use a workflow engine to compose the above components in arbitrary ways.

其中每一项任务本身都受到了相当多的关注,从而产生了单点解决方案。单点解决方案将帮助数据科学家完成数据准备任务;然而,端到端系统可以更好地支持数据科学家解决上下文中的每个问题,并从跨职能的协同和优化中受益。

Each of these tasks has received considerable attention on its own resulting in point solutions. Point solutions would help a data scientist with the data preparation task; however, an end-to-end system could better support the data scientist in solving each problem in context and benefit from synergies and optimization across the functions.

设计端到端系统

Designing an End-to-End System

如果您觉得上面的例子难以克服,那么我们只能说它过于简单化了。你会如何解决这个问题?好吧,作为计算机科学家,你会开始思考如何将大问题分割成更小的问题。通过很好地定义较小的问题,您可以提出技术上合理的解决方案,甚至可以撰写评估并发表论文!问题在于,成功解决较小的问题并不一定能有效解决较大的问题。迈克的方法是垂直解决问题——只需找到一个端到端的示例——同时“保持简单”。具有简单原型的端到端示例一直是关键的指导原则。迈克从这个提案开始,它后来成为我们该项目的路线图。

If you have the sense that the example above is insurmountable, let us just say it is vastly oversimplified. How would you attack the problem? Well, as computer scientists, you would start thinking of how to slice the large problem into smaller chunks. By defining smaller problems very well, you can come up with technically sound solutions, and even write an evaluation and publish a paper! The problem is that successfully solving the smaller problems does not necessarily lead to an effective solution to the larger problem. Mike’s approach is to attack the problem vertically—just find an end-to-end example—and at the same time, “keep it simple.” End-to-end examples with simple prototypes have been a key guiding principle. Mike started with this proposal, which then became our roadmap for the project.

要设计一个系统或快速的端到端原型,需要首先了解需求,而这需要用例。迈克永远不会接受对合成数据进行测试的想法,因为“这不现实”。因此,迈克的原则包括:将您的想法置于情境中,了解真正的问题和您的贡献范围。基于此,我们的计划是解决真实用户的需求,换句话说,正如迈克所说,解决“用户的痛苦”。因此,迈克引入了几个现实世界的用例。默克公司的 Mark Schreiber 和麻省理工学院数据仓库团队的 Faith Hill 就是一个例子。这些人面临着挑战筛选大量数据集(通常有数千个)以查找与特定任务相关的数据。他或她必须找到感兴趣的数据,然后对其进行管理,即,通过将其经过涉及组合相关数据集、删除异常值、查找重复项、标准化的管理过程,从大量潜在相关源数据中生成一致的输出数据集价值观等。另一个例子是以诺华公司的 Nabil Hachim 为代表的专业 IT 人员。他有一项企业范围的集成任务需要执行。他必须不断地将已知数据集的集合放入类似的管理管道中,以便为诺华的数据科学家生成最终的数据集集合。

To design a system, or a quick end-to-end prototype, one needs to understand the requirements first, and that requires use cases. Mike would never accept testing ideas on synthetic data because “it’s not realistic.” So, Mike’s principles include: contextualize your ideas and understand the real problems and the scope of your contributions. Based on that, our plan was to address the needs of real users, in other words, a “user’s pain,” as Mike puts it. So, Mike brought in several real-world use cases. One example came from Mark Schreiber of Merck and Faith Hill of the MIT data warehouse team. These individuals are faced with the challenge of sifting through a very large number of datasets (often in the thousands) to find data relevant to a particular task. He or she must find the data of interest and then curate it, i.e., generate a coherent output dataset from this massive pool of potentially relevant source data by putting it through a curation process that involves combining related datasets, removing outliers, finding duplicates, normalizing values, and so on. Another example was of a professional IT person, typified by Nabil Hachim of Novartis. He has an enterprise-wide integration task to perform. He must continuously put a collection of known datasets through a similar curation pipeline to generate a final collection of datasets for Novartis’ data scientists.

寻求有实际问题的外部用户对工作原型提供反馈的方法最终将 Data Civilizer 项目塑造成当前的方向。只有经过大量的互动和数小时的协作,人们才能了解什么是“帐篷的最高杆”并设计系统,以避免“橡胶撞到路面”(这两个流行的迈克短语)时出现意外。

The approach of seeking external users with real problems to give feedback on working prototypes is what eventually shaped the Data Civilizer project into its current direction. Only after numerous interactions and hours of collaboration does one get to learn what is the “highest pole of the tent” and design the system so as to avoid surprises when the “rubber hits the road” (both popular Mike phrases).

Data Civilizer 的设计和构建必须满足这些人的需求并“减轻他们的痛苦”。这些是我们迄今为止构建的一些模块:

Data Civilizer has to be designed and built so that it meets the needs of such people and to “ease their pain.” These are some of the modules that we have built so far:

1. 用于构建企业知识图谱的模块,该知识图谱对数据进行总结和索引,并通过可用的本体揭示所有可能的语法和语义关系;

1.  a module to build an enterprise knowledge graph that summarizes and indexes the data as well as uncovers all possible syntactic and semantic relationships, via available ontologies;

2. 灵活的数据发现系统,具有多个查询来查找相关数据以及连接它们的可能方法;

2.  a flexible data discovery system with several queries to find relevant data and possible ways to join them;

3.通过自动发现缩写和伪装,即默认值、缺失值,对数据进行多种转换和清理;和

3.  various ways to transform and clean the data through automatically discovering abbreviations and disguised, i.e., default value, missing values; and

4. 一个用户引导模块,用于将发现重复的记录合并为一个规范表示或黄金记录。

4.  a user-guided module to consolidate records found to be duplicates into one canonical representation or golden record.

数据文明者的挑战

Data Civilizer Challenges

在本节中,我们将详细讨论我们在构建 Data Civilizer 时一直致力于解决的某些挑战。

In this section, we discuss in some detail certain challenges we have been working on to build Data Civilizer.

数据转换挑战

The Data Transformation Challenge

当集成来自多个源的数据时,通常需要执行不同类型的转换。这些转换需要将数据元素从一种表示形式转换为另一种表示形式,例如单位、货币和日期格式转换,并生成语义上不同但相关的值,例如,机场代码到城市名称,ISBN 到书名。虽然某些转换可以通过公式计算,例如磅到千克,但其他转换则需要在字典或其他数据源中查找。对于这种语义转换,我们找不到足够的自动化系统或工具。显然,语义转换不能仅通过查看输入值(例如,应用公式或字符串运算)来计算。相反,所需的转换通常可以在映射表中找到,该映射表要么显式地可供应用程序使用(例如,作为数据仓库中的维表),要么隐藏在转换服务或 Web 表单后面。

When integrating data from multiple sources there is often a need to perform different kinds of transformations. These transformations entail converting a data element from one representation to another, e.g., unit, currency, and date format conversions, and generating a semantically different but related value, e.g., airport code to city name, and ISBN to book title. While some transformations can be computed via a formula, such as pounds to kilograms, others require looking up in a dictionary or other data sources. For such semantic transformations, we could not find an adequate automatic system or tool. It is clear that semantic transformations cannot be computed solely by looking at the input values, for example, and applying a formula or a string operation. Rather, the required transformations are often found in a mapping table that is either explicitly available to the application (e.g., as a dimension table in a data warehouse) or is hidden behind a transformation service or a Web form.

因此,面临的挑战是找到易于获得且有助于自动化该过程的数据源。所以,迈克说:“你为什么不尝试一下网络表格呢?” 事实上,许多 Web 表(例如机场代码到城市、SWIFT 代码到银行、代码到公司)可能完全或部分包含我们所追求的转换。因此,我们开始研究在给定一些输入和输出示例的情况下自动发现转换的方法。

So, the challenge was to find sources of data that are readily available and that could help to automate this process. So, Mike said: “Why don’t you try Web tables?” Indeed, many Web tables, such as airport code to city, SWIFT code to bank, and symbol to company, may contain just the transformations we are after, either entirely or partially. So, we started working on ways to automatically discover transformations given some input and output examples.

Mike 还建议研究如何自动利用 Web 表单,因为其中许多表单(例如货币转换器)可以帮助完成转换任务。我们进一步扩展了工作范围,利用知识库来涵盖更多的转换,主要是突出的主题转换,例如足球运动员到出生地、足球运动员到出生日期或国家到国家元首。另一个扩展是寻找非功能性转换,例如书籍到作者、团队到球员。为了评估我们的工具,Mike 的想法是简单地收集转换任务(主要来自 Tamr 的工程师),然后看看我们使用不同来源可以实现多少覆盖范围。事实证明,覆盖率相当高。我们能够完成 120 项转型任务中的 101 项。我们工具的初步想法首先在 CIDR 2015 的一篇愿景论文中描述 [Abedjan 等人。2015b]。然后我们在 SIGMOD 2015 上展示了完整的演示 [Morcos 等人。2015] 以及 ICDE 2016 上的完整论文 [Abedjan 等人。2016b]。最重要的是,我们在 SIGMOD 2015 中获得了最佳演示奖!

Mike also suggested looking at how to automatically exploit Web forms, as many of them, such as currency converters, can help in the transformation task. We further extended the work to exploit knowledge bases for covering more transformations, mostly prominent head-topic transformations, such as soccer player to birthplace, soccer player to birth date, or country to head of state. Another extension was to find non-functional transformations such as books to authors and teams to players. To evaluate our tool, Mike’s idea was to simply collect transformation tasks, mostly from engineers at Tamr, and then see how much coverage we could achieve using the different sources. As it turned out, the coverage was quite high. We were able to cover 101 transformation tasks out of 120. Preliminary ideas of our tool were first described in a vision paper in CIDR 2015 [Abedjan et al. 2015b]. We then presented a full demo in SIGMOD 2015 [Morcos et al. 2015] and a full paper at ICDE 2016 [Abedjan et al. 2016b]. Most importantly, we won a best demo award in SIGMOD 2015!

数据清理挑战

The Data Cleaning Challenge

Mike 向我们介绍了 Recorded Future,这是一家情报公司,该公司监控超过 700,000 个网络资源,寻找威胁和其他情报。我们的想法是获取一些事件数据,看看它有多干净或多脏,以及现有或即将发现的方法是否可以帮助检测和修复此类数据中的错误。虽然迈克确信数据是脏的,应该采取一些措施来清理它,但他对是否能够“自动”表示怀疑清理数据。事实上,他向我们提出了以下挑战:“如果你可以自动清理 50% 的数据,我将为我的公司授权该技术”,以及“我相信你可以清理 2% 的数据。” 挑战是艰巨的。

Mike had introduced us to Recorded Future, an intelligence company that monitors more than 700,000 Web sources looking for threats and other intelligence. The idea was to get some of its event data and see how clean or dirty it was and whether existing or to-be-discovered approaches could help in detecting and repairing errors in this kind of data. While Mike was convinced that the data was dirty and that something ought to be done to clean it, he was skeptical about being able to “automatically” clean the data. In fact, he threw the following challenge at us: “If you can automatically clean 50% of the data, I will license the technology for my company,” and “I believe you can clean 2% of the data.” The challenge was daunting.

Mike 帮助保护了 Recorded Future 提取的三个月数据快照:约 1.88 亿 JSON 文档,总大小约 3.9 TB。每个 JSON 文档都包含在实体及其属性上定义的提取事件。实体可以是人、位置、公司等的实例。事件也有属性。总共有 1.5 亿个唯一事件实例。

Mike helped secure a three-month snapshot of data extracted by Recorded Future: about 188M JSON documents with a total size of about 3.9 TB. Each JSON document contained extracted events defined over entities and their attributes. An entity can be an instance of a person, a location, a company, and so on. Events also have attributes. In total, there were 150M unique event instances.

使用不同的分析器并通过目测查看数据,很明显它包含许多错误。一个主要的观察结果是,一些报告的事件在按时间尺度排列时并不能很好地结合在一起。例如,我们看到巴拉克·奥巴马在不到一个小时内就到达了意大利和南非。我们发现了一些类似的案例,涉及人们四处旅行以及其他事件,例如内幕交易和就业变动。为了捕获这种错误,我们引入了一种新型的时间依赖性,即时间函数依赖性。发现此类规则的关键挑战源于 Web 数据的本质:提取的事实 (1) 随着时间的推移变得稀疏,(2) 报告有延迟,(3) 由于来源不准确或不准确,报告的值经常有错误。 - 坚固的提取器。

Looking at the data using different profilers and through eyeballing, it was clear that it contained many errors. One major observation was that some of the reported events did not fit well together when putting them on a time scale. For example, we saw that within less than an hour Barack Obama was in Italy and in South Africa. We discovered several similar cases for people traveling around as well for other events such as insider transactions and employment changes. To capture this kind of error, we introduced a new type of temporal dependency, namely Temporal Functional Dependencies. The key challenges in discovering such rules stem from the very nature of Web data: extracted facts are (1) sparse over time, (2) reported with delays, and (3) often reported with errors over the values because of inaccurate sources or non-robust extractors.

实际技术的详细信息可以在我们的 PVLDB 2016 论文中找到 [Abedjan 等人,2016 年。2015a]。更重要的是,我们的实验结果非常积极;我们表明,时间规则提高了数据质量,清理过程中的平均精度从 0.37 增加到 0.84,平均 F 测量相对增加 40%。

Details of the actual techniques can be found in our PVLDB 2016 paper [Abedjan et al. 2015a]. More importantly, our experimental results turned out to be quite positive; we showed that temporal rules improve the quality of the data with an increase of the average precision in the cleaning process from 0.37 to 0.84, and a 40% relative increase in the average F-measure.

继续面临数据清理挑战,Mike 希望了解当橡胶上路时,使用许多现有的数据清理技术和系统(例如基于规则的检测算法)会发生什么。2015a,Chu 等人。2013a,Wang 和 Tang 2014,Fan 等人。2012,达拉奇萨等人。2013 年,Khayyat 等人,2015 年];模式执行和转换工具,例如 OpenRefine、Data Wrangler [Kandel 等人。2011],及其商业后代 Trifacta,Katara [Chu 等人。2015] 和 DataXFormer [Abedjan 等人。2015b];定量误差检测算法 [Dasu and Loh 2012, Wu and Madden 2013, Vartak et al 2015, Abedjan et al. 2015,Prokoshyna 等人 2015];用于检测重复数据记录的记录链接和重复数据删除算法,例如 Data Tamer 系统 [Stonebraker 等人。

Continuing with the data cleaning challenge, Mike wanted to see what would really happen when the rubber hit the road with the many existing data cleaning techniques and systems, such as rule-based detection algorithms [Abedjan et al. 2015a, Chu et al. 2013a, Wang and Tang 2014, Fan et al. 2012, Dallachiesa et al. 2013, Khayyat et al 2015]; pattern enforcement and transformation tools such as OpenRefine, Data Wrangler [Kandel et al. 2011], and its commercial descendant Trifacta, Katara [Chu et al. 2015], and DataXFormer [Abedjan et al. 2015b]; quantitative error detection algorithms [Dasu and Loh 2012, Wu and Madden 2013, Vartak et al 2015, Abedjan et al. 2015, Prokoshyna et al 2015]; and record linkage and de-duplication algorithms for detecting duplicate data records, such as the Data Tamer system [Stonebraker et al. 2013b] and its commercial descendant, Tamr.

那么,这些技术和系统在现实世界的数据上运行时真的有效吗?一个关键的观察是,没有使用真实数据为这些技术和系统建立基准。于是,迈克组建了一个团队科学家,博士。来自麻省理工学院、QCRI 和滑铁卢大学的学生和博士后,每个站点的任务是处理一个或多个数据集,并在其上运行一个或多个数据清理工具。进行这种设置的原因之一不仅是为了分工,还因为某些数据集由于其所有者施加的限制而无法从一个站点移动到另一个站点。我们召开了几次会议,不同的实验必须以结果具有可比性的方式进行。Mike 在协调所有这些工作并确保我们集中精力在 VLDB 2016 的最后提交截止日期前发挥了重要作用。其中一个重要因素是 Mike 在每次会议结束时就 VLDB 提交的“行进命令”声明。需要在明确的时间范围内完成的具体任务。

So, do these techniques and systems really work when run on data from the real world? One key observation was that there was no established benchmarking for these techniques and systems using real data. So, Mike assembled a team of scientists, Ph.D. students, and postdocs from MIT, QCRI, and University of Waterloo, with each site tasked to work on one or more datasets and run one or more data cleaning tools on them. One reason for such a setting was not only for the division of labor but also because some of the datasets could not be moved from one site to another due to restrictions imposed by their owners. We had several meetings and it was imperative that the different experiments be performed in a way that results were comparable. Mike played a great role in coordinating all of these efforts and making sure that we stayed focused to meet the deadline for the last submission for VLDB 2016. One important ingredient was a “marching order” statement from Mike at the end of each meeting on the specific tasks that needed to be accomplished within a well-defined timeframe.

一个关键结论是不存在单一的主导工具。从本质上讲,各种工具在不同的数据集上都表现良好。显然,在任何实际环境中都必须使用整体的“复合”策略。这并不奇怪,因为每个工具都是为了检测某种类型的错误而设计的。详细信息和结果可以在我们的 PVLDB 2016 论文中找到 [Abedjan 等人。2016a]。

A key conclusion was that there is no single dominant tool. In essence, various tools worked well on different datasets. Obviously, a holistic “composite” strategy must be used in any practical environment. This is not surprising since each tool has been designed to detect errors of a certain type. The details and results can be found in our PVLDB 2016 paper [Abedjan et al. 2016a].

数据发现挑战

The Data Discovery Challenge

当分析师花费更多时间查找相关数据而不是解决手头的实际问题时,我们说分析师存在数据发现问题。事实证明,数据丰富的组织中的大多数分析师(几乎是每个人)都不同程度地遇到了这个问题。由于多种原因,这一挑战令人畏惧。

We say an analyst has a data discovery problem when he or she spends more time finding relevant data than solving the actual problem at hand. As it turns out, most analysts in data-rich organizations—that’s almost everybody—suffer this problem to varying degrees. The challenge is daunting for several reasons.

1. 分析师可能对与其目标相关的各种数据感兴趣。相关数据可能在常用数据库中的单个关系中等待它们,但它也可能位于从孤立的 RDBMS 复制并存储在湖中的 CSV 文件中,或者只有在连接其他两个表后才变得明显。

1.  Analysts may be interested in all kinds of data relevant to their goal. Relevant data may be waiting for them in a single relation in an often-used database, but it may also be in a CSV file copied from a siloed RDBMS and stored in a lake, or it may become apparent only after joining two other tables.

2. 分析师可能对他们所需的数据有强烈的直觉,但并不总是完全了解数据包含的内容。如果我想回答:“我公司每个部门的性别差距分布是多少?” 我对我想在我面前看到的模式有强烈的直觉,但我可能不知道在哪里可以找到这些数据,在哪个数据库中,使用什么模式等等。

2.  Analysts may have a strong intuition about the data they need, but not always a complete knowledge of what it contains. If I want to answer: “What’s the gender gap distribution per department in my company?” I have a strong intuition of the schema I’d like to see in front of me, but I may have no clue on where to find such data, in what database, with what schema, etc.

3. 组织中的数据量巨大、异构、不断增长且不断变化。我们有一个用例,大约有 4,000 个 RDBMS,另一个用例有一个包含 2.5 PB 数据的湖和一些 SQL Server 实例,还有一个用例,组织不确切知道它拥有的数据量,因为“它只是被分成多个系统,但我们这里有大量数据。” 当然,这些数据总是在变化。

3.  The amount of data in organizations is humongous, heterogeneous, always growing, and continuously changing. We have a use case that has on the order of 4,000 RDBMSs, another one that has a lake with 2.5 PB of data plus a few SQL Server instances, and a use case where the organization does not know exactly the amount of data it has because “it’s just split into multiple systems, but it’s a whole lot of data we have in here.” Of course, this data is always changing.

4. 不同的分析师会有截然不同的发现需求。虽然销售部门的分析师可能有兴趣重新访问客户提出的所有投诉,但营销部门的分析师可能希望广泛访问可能对组装构建预测模型所需的功能有用的任何数据。作为日常工作的一部分,分析师将有非常不同的数据发现需求,并且这种需求将不断、自然地变化。

4.  Different analysts will have very different discovery needs. While an analyst in the sales department may be interested in having fresh access to all complaints made by customers, an analyst in the marketing department may want wide access to any data that could be potentially useful to assemble the features needed to build a prediction model. As part of their day-to-day jobs, analysts will have very different data discovery needs, and that will be changing continuously and naturally.

好消息是,我们已经构建了一个原型,可以帮助实现上述四点。一般来说,推理是要认识到发现需求会随着时间的推移而变化,今天相关的内容明天就不再相关。然而,观察结果是,要使 X 与 Y 相关,X 和 Y 之间必须存在“关系”。因此,我们的想法是从数据中提取所有可能的关系。关系可能包括诸如列的相似性、模式的相似性、功能依赖性(例如PK/FK;主键/外键)、甚至语义关系等方面。然后,所有关系都在我们所谓的“企业知识图谱 (EKG)”中具体化,这是一种表示组织内数据源之间关系的图结构。

The good news is that we have built a prototype that helps in all of the above four points. In general, the reasoning is to recognize that discovery needs will change over time, and what is relevant today will not be relevant tomorrow. However, the observation is that for X to be relevant to Y, there must be a “relationship” between X and Y. So the idea is to extract all possible relationships from the data. Relationships may include aspects such as similarity of columns, similarity of schemas, functional dependencies (such as PK/FK; primary key/foreign key), and even semantic relationships. All relationships are then materialized in what we have named an “enterprise knowledge graph (EKG),” which is a graph structure that represents the relationships between data sources within an organization. This EKG is central to approach the data discovery challenges, as we explain next, but it comes with its own challenge: It must be built first!

Aurum 是一个用于构建、维护和查询 EKG 的系统。它通过应用来自系统、草图和分析的大量技术来构建它,以避免在尝试读取和计算数千个数据源之间的复杂关系时会发现的巨大可扩展性瓶颈。它通过了解底层数据何时发生变化并相应地更新 EKG 来维护 EKG,以确保其始终拥有最新数据。最后,我们构建了一个发现原语的集合,可以任意组合这些原语来编写复杂的数据发现查询。反过来,界面分析师必须查询 EKG 并查找相关数据。

Aurum is a system for building, maintaining, and querying the EKG. It builds it by applying a lot of techniques from systems, sketching and profiling, so as to avoid the huge scalability bottleneck one would find otherwise when trying to read and compute complex relationships among thousands of data sources. It maintains the EKG by understanding when the underlying data changes and updating the EKG accordingly, so as to ensure it always has fresh data. Last, we built a collection of discovery primitives, which can be composed arbitrarily to write complex data discovery queries. This is in turn the interface analysts will have to query the EKG and to find relevant data.

数据发现并不是一个已解决的问题,但上面概述的方法使我们能够更深入地探索开放问题,并帮助少数合作者满足他们自己的发现需求。遵循 Mike 的垂直方法(尽早让端到端系统发挥作用,然后确定“帐篷中最高的杆子”并保持简单),帮助我们完善了大型 Data Civilizer 项目中尚未解决的最大问题。使用 Aurum 发现数据的方法帮助我们了解跨存储库加入的挑战;它已经使我们认真思考当两个看似相同的来源的数据质量不同时该怎么办,这帮助我们理解了数据转换的重要性,这最终使我们有更多机会找到更相关的数据。

Data discovery is not a solved problem, but the approach sketched above has led us to explore more in depth the open problems—and is helping a handful of collaborators with their own discovery needs. Following Mike’s vertical approach (get an end-to-end system working early, then figure the “highest pole in the tent,” and keep it simple), has helped us refine the biggest issues yet to solve in the larger Data Civilizer project. Having a way of discovering data with Aurum has helped us understand the challenges of joining across repositories; it has made us think hard about what to do when the data quality of two seemingly equivalent sources is different, and it has helped us to understand the importance of data transformation, which ultimately enables more opportunities for finding more relevant data.

我们的研究才刚刚开始,但我们已经制造了一辆合理的汽车,并且正在全速前进——希望是——朝着正确的方向!

We are just at the beginning of the road on this research, but we have built a reasonable car and are going full speed—hopefully—in the right direction!

结束语

Concluding Remarks

评论 Mike 对 Ingres 和 Vertica 等已经产生明显影响的成熟项目的贡献与对 Data Civilizer 等正在进行的项目的贡献有很大不同。然而,从我们与迈克五年的合作中,我们可以提炼出我们认为这种合作的持续特征,这些特征在很大程度上塑造了这个项目。

It is very different to comment on Mike’s contributions on well-established projects that have already had clear impacts, such as Ingres and Vertica, rather than on an ongoing project, such as Data Civilizer. However, from our five-year-old collaboration with Mike, we can distill what we think are the constant characteristics of such collaboration, the ones that have largely shaped the project.

1. 对于一小群合作者来说,看似太大的问题在找到攻击策略后最终变得可控。攻击是为了尽早获得一两个清晰、精确、端到端的用例。

1.  What was seemingly a too-big problem for a reasonably small group of collaborators ended up being manageable after finding an attack strategy. The attack was to have one or two clear, precise, end-to-end use cases early on.

2. 如果您想让研究具有相关性并致力于解决学术界之外的重要问题,那么请将端到端用例基于人们在野外遇到的实际问题,然后再设计一个快速原型。

2.  If you want to make the research relevant and work on problems that matter beyond academia, then base that end-to-end use case on real problems people suffer from in the wild, and only then design a quick prototype.

3. 在野外尝试原型并了解失败的原因,以便您可以从那里继续前进。

3.  Try out the prototype in the wild and understand what fails so that you can move forward from there.

1 . http://www.nvisia.com/insights/data-swamp。上次访问时间为 2018 年 3 月 22 日。

1. http://www.nvisia.com/insights/data-swamp. Last accessed March 22, 2018.

第七部分 B

PART VII.B

建筑系统的贡献

Contributions from Building Systems

24

24

商业安格尔代码线

The Commercial Ingres Codeline

保罗·巴特沃斯、弗雷德·卡特

Paul Butterworth, Fred Carter

Mike Stonebraker 在软件业务中最早的成功是 Relational Technology, Inc. (RTI),后来更名为 Ingres Corporation,该公司由 Mike、Gene Wong 和 Larry Rowe 于 1980 年成立,旨在将 Ingres 研究原型商业化。

Mike Stonebraker’s earliest success in the software business was Relational Technology, Inc. (RTI), later renamed Ingres Corporation, which was formed by Mike, Gene Wong, and Larry Rowe in 1980 to commercialize the Ingres research prototype.

作为背景,我们(保罗和弗雷德)没有参与安格尔的学术研究项目。Paul 从 1980 年开始专门负责管理创建商业 Ingres 产品的日常工作,Fred 于 1982 年加入 RTI,以提高 RTI 在网络和分布式计算方面的工业专业知识。

As context, we (Paul and Fred) had no involvement with the Ingres academic research projects. Paul was brought on specifically to manage the day-to-day effort of creating the commercial Ingres product starting in 1980, and Fred joined RTI in 1982 to boost RTI’s industrial expertise in network and distributed computing.

Ingres 项目产生了第一个关系数据库,既充当原型/概念验证又充当操作系统。Ingres 系统基于声明性查询语言 (QUEL),具有独立于查询语句本身的优化器。在相当长的一段时间内,这是第一个也是唯一一个针对关系数据库的此类优化器。基于工作研究代码线(已提供给各个研究合作伙伴并由其使用),商业 Ingres 交付了一个关系数据库系统,该系统成为许多客户业务的基石。

The Ingres project produced one the first relational databases, functioning as both a prototype/proof of concept and an operational system. The Ingres system was based on a declarative query language (QUEL) with an optimizer that was independent of the query statement itself. This was a first, the only such optimizer for relational databases around—for quite a while. Based on the working research codeline (which had been provided to and used by various research partners), commercial Ingres delivered a relational database system that became the cornerstone of many customers’ businesses.

迈克在“好创意从何而来以及如何利用它们”一章(第 10 章)中指出:“安格尔之所以产生影响,主要是因为我们坚持不懈,并找到了一个真正可以发挥作用的系统。” 这个工作系统对于安格尔的商业成功极其重要。(我们将在下一节中进一步探讨这一点。)

In his chapter “Where Good Ideas Come from and How to Exploit Them” (Chapter 10), Mike states: “Ingres made an impact mostly because we persevered and got a real system to work.” This working system was tremendously important to the commercial Ingres success. (We’ll explore this a bit more in the next section.)

此外,该研究项目仍在继续。由此,尽管许多功能被添加到商业 Ingres 产品中。这些包括但不限于分布式 Ingres(Ingres Star)、用户定义类型(见下文)和改进的优化器。

Moreover, the research project continued. From this, although a number of features were added to the commercial Ingres product. These include, but are not limited to, distributed Ingres (Ingres Star), user-defined types (see below), and an improved optimizer.

Mike 在研究系统方面的持续工作以及他作为 RTI 一部分的工作使我们能够快速前进。我们非常幸运,在前进的过程中拥有这些知识和远见。

Mike’s continuing work, both on the research system and his work as part of RTI, allowed us to move forward quickly. We were very fortunate to have that knowledge and vision as we forged ahead.

以下各节将更详细地研究其中的一些工作。

The following sections look into some of this work in more detail.

研究到商业

Research to Commercial

第一个商业努力是采用 Ingres 研究 DBMS 代码并将其转换为商业产品。这项活动涉及将研究原型从 PDP-11 上的 Unix 转换为 VAX 上的 VAX/VMS。这一转换是由 Paul 和 Derek Frankforth 完成的,他们是一位真正有天赋的系统程序员,制作了一套托管 Ingres 原型的 VAX/VMS 改编版本。原型代码(绑定到 Unix 并设计用于使大型系统在 PDP-11 上运行)必须重新设计或消除。由于我们没有 Unix/PDP-11,因此我们没有可用的 Ingres 代码的运行版本,这使得这是一个非常有趣的取证练习。在某些情况下,不清楚各个模块的意图是什么以及正确的结果应该是什么样子。多次,我们会让某些东西运行起来,然后简单地通过系统运行查询来尝试弄清楚我们认为该模块应该做的事情是否真的是它所做的。回溯并不罕见!在完成转换工作后,我们对大学研究团队给予了极大的信任,因为核心数据库代码是可靠的。因此,我们不必担心维护 DBMS 语义的正确性。

The first commercial effort was taking the Ingres research DBMS code and converting it into a commercial product. This activity involved converting the research prototype from Unix on PDP-11s to VAX/VMS on the VAX. This conversion was done by Paul and Derek Frankforth, a truly gifted systems programmer, producing a set of VAX/VMS adaptations on which the Ingres prototype was hosted. The prototype code—bound to Unix and designed to make a large system run on PDP-11s—had to be reworked or eliminated. Since we had no Unix/PDP-11, we had no running version of the Ingres code available, making this a very interesting forensic exercise. In some cases, it was unclear what the intent of various modules was and what correct results should look like. Many times, we would get something running and then simply run queries through the system to try to figure out if what we thought the module should do is really what it did. Backtracking was not unusual! Having worked on the conversion effort, we give a huge amount of credit to the university research team because the core database code was solid. Thus, we didn’t have to worry about maintaining the correctness of the DBMS semantics.

有趣的是,转换工作和后续开发活动是在没有与 Ingres 研究团队直接沟通的情况下进行的,而且有趣的是,该研究团队的成员都没有加入 RTI。Ingres 的商业努力明显早于当代开源概念(见第 12 章))并从加州大学伯克利分校开发的公开代码开始。由于没有已知的流程,并且为了确保我们不会让大学里的任何人处于尴尬的境地,我们的工作没有受益于他们对代码库更深入的了解。有趣的是,最初的研究原型是 Ingres 研究团队开发的唯一用于商业 Ingres 的代码,因为这两个代码线很快就出现了分歧。

It is interesting to note that the conversion effort and subsequent development activities were performed without direct communication with the Ingres research team, and it’s an interesting coincidence that none of the members of the research team joined RTI. The commercial Ingres effort significantly predated the contemporary notion of open source (see Chapter 12) and started with the publicly available code developed at the University of California, Berkeley. Without a known process and in an effort to make sure we were not putting anyone at the university in an awkward position, we worked without the benefit of their much deeper knowledge of the code base. It is also interesting to note the original research prototype was the only code developed by the Ingres research team used in commercial Ingres, as the two codelines quickly diverged.

商业 Ingres 初始版本中的其他主要变化是删除了在 PDP-11 上运行的 Ingres 多进程版本中驱动通信的代码(因为这只是 VAX 上的额外开销);强化许多系统组件,以便以更优雅的方式处理错误;并添加一些工具以使 Ingres 成为更完整的商业产品。

Other major changes in the initial release of commercial Ingres were removing the code that drove communication in the multi-process version of Ingres that ran on the PDP-11 (since that was just extra overhead on the VAX); hardening many of the system components so that errors were dealt with in a more graceful fashion; and adding a few tools to make Ingres a more complete commercial offering.

尽管没有与 Ingres 研究团队密切合作,但商业 Ingres 工程受益于 Mike、Larry 和 Gene 的研究工作的不断更新,其中许多工作很快在商业 Ingres 中得到实施。例子包括: 分布式 Ingres,在商业产品 Ingres Star [Wallace 1986] 中引入;抽象数据类型,1在后来的 Ingres 研究活动和 Postgres 中进行了研究,后者作为通用数据类型引入商业 Ingres(下面详细讨论);Postgres 规则系统的简化形式;开发了一套全面的开发人员和最终用户工具,使商业 Ingres 对商业界具有吸引力。由于代码库立即出现分歧,因此这些功能和其他功能是在不了解研究实现的情况下实现的。接下来的几段讨论了我们在将研究想法融入商业产品时遇到的一些协同作用和惊喜。

Although not working closely with the Ingres research teams, commercial Ingres engineering benefited from continuous updates on Mike’s, Larry’s, and Gene’s research efforts, and many of those efforts were quickly implemented in commercial Ingres. Examples include: Distributed Ingres, which was introduced in the commercial product Ingres Star [Wallace 1986]; abstract datatypes,1 investigated in later Ingres research activities and Postgres, which were introduced into commercial Ingres as Universal Data Types (discussed in detail below); a simplified form of the Postgres rule system; and the development of a comprehensive suite of developer and end-user tools that made commercial Ingres attractive to the business community. Since the code bases diverged immediately, these features and others were implemented without any knowledge of the research implementations. The next few paragraphs discuss some of the synergies and surprises we encountered when incorporating research ideas into the commercial product.

生产产品

Producing a Product

一旦系统在VAX/VMS上运行,下一个问题就是如何增加其商业吸引力。这需要提高稳健性、可扩展性和性能,并提供使企业开发团队可以访问系统所需的全套工具。当我们第一次推出商业 Ingres 时,我们以每次查询的秒数而不是每秒的查询数来衡量性能!

Once the system was running on VAX/VMS, the next problem was how to increase its commercial attractiveness. This required improving robustness, scalability, and performance, and providing the comprehensive set of tools required to make the system accessible to corporate development teams. When we first brought commercial Ingres up, we measured performance in seconds per query rather than queries per second!

提高性能涉及基础工程工作,以使系统更加高效,以及将各种 Ingres 研究工作中正在进行的研究应用于该问题。许多改进来自于提高整个系统的代码效率、缓存常用的元数据和数据库页面以及提高操作系统资源的使用效率。由于此类工程改进,两年内系统的效率提高了 300 倍。其中许多改进是迈克和拉里提出的,他们之前认为这些改进是研究原型的性能改进,但不能证明研究投资是合理的。

Improving performance involved basic engineering work to make the system more efficient as well as applying ongoing research from various Ingres research efforts to the problem. Much improvement came from improving code efficiency throughout the system, caching frequently used metadata as well as the database pages, and improving the efficiency with which operating system resources were used. Within two years, the efficiency of the system increased by a factor of 300 due to such engineering improvements. A number of these improvements were suggested by Mike and Larry as thoughts they had previously considered as performance improvements in the research prototype but could not justify as research investments.

此外,我们利用正在进行的研究工作来产生好的和坏的影响。Ingres 研究团队一度提出了新版本的 Ingres 动态优化器。我们在商业 Ingres 中实现了该算法,却发现 Ingres 查询处理系统现在速度快得多,以至于更复杂的优化器带来的额外开销实际上减少许多查询的性能而不是改进它。这项努力在发布之前就被放弃了。此后不久,我们致力于将基于其他 Ingres 研究(不是来自伯克利)的查询优化转换为基于统计的优化。事实上,Mike 帮助我们将这项研究的作者 Bob Kooi 招募到 RTI 来实现这项工作的商业版本。这一努力非常成功,因为正如许多基准测试所表明的那样,Ingres 被认为在公司的整个生命周期中拥有最好的复杂查询优化。

In addition, we leveraged ongoing research efforts to both good and bad effects. At one point the Ingres research team came up with a new version of the Ingres dynamic optimizer. We implemented the algorithm within commercial Ingres only to find that the Ingres query processing system was now so much faster that the additional overhead from the more sophisticated optimizer actually reduced the performance of many queries rather than improving it. This effort was abandoned before being released. Soon thereafter we took on the effort of converting to query optimization based on other Ingres research (not from Berkeley) into statistics-based optimization. In fact, Mike helped us recruit the author of the research, Bob Kooi, to RTI to implement a commercial version of this work. This effort was very successful as Ingres was acknowledged to have the best complex query optimization throughout the life of the company, as shown by a number of benchmarks.

与大学的 DBMS 性能工作相比,商业 Ingres 的性能工作很少由正式的性能模型驱动。大部分工作只是测量系统的性能,识别成本最高的模块,然后改进代码或消除高成本资源的使用。

In contrast to DBMS performance work at the university, performance work on commercial Ingres was rarely driven by formal performance models. Much of the work involved just measuring the performance of the system, identifying the highest cost modules, and then improving the code or eliminating the use of high cost resources.

课。从这项工作中,我们可以总结出一些教训。事实上,我们从一个完整且有效的代码库开始,这让我们的生活变得更加轻松。如前所述,我们不必花时间担心数据库语义,因为这些语义在研究系统中是正确的。相反,我们可以专注于构建商业系统。错误处理、内部操作等至关重要。将项目转变为商业系统涉及大量工作,以确保恢复始终完整且一致。作为构建任何原型的一部分,这是一个重要的考虑因素。在我们前进的过程中,我们非常幸运能够拥有像迈克这样的知识和远见的人。Mike 推动了 Ingres 的技术议程,因此,Ingres 长期以来被公认为技术领导者和富有远见的数据库。

Lesson. From this work, we can distill a few lessons. The fact that we started with a complete and working code base made our lives much easier. As noted, we did not have to spend time worrying about the database semantics, as those were correct in the research system. Instead, we could focus on building a commercial system. Error handling, internal operations, etc., are critical. Turning the project into a commercial system involves a lot of work to ensure that recovery is always complete and consistent. As part of building any prototype, this is an important consideration. We were very fortunate to have someone with Mike’s knowledge and vision as we moved forward. Mike drove the technical agenda for Ingres, and, consequently, Ingres was recognized for a long time as the technical leader and visionary database.

存储结构

Storage Structures

正如 Michael Carey 所指出的(参见第 15 章),Ingres 的贡献之一是完整的存储管理器可以与优化器集成,以提高数据库管理器的性能。当我们将 Ingres 引入商业世界时,我们遇到的问题之一涉及到这个存储管理器。

As Michael Carey has noted (see Chapter 15), one of the Ingres contributions was that a full storage manager could be integrated with the optimizer to improve the performance of the database manager. As we moved Ingres into the commercial world, one of the issues that we encountered involved this storage manager.

具体来说,Ingres 中的索引结构(HASH 和 ISAM [索引顺序访问方法])是静态的。索引结构(无论是散列桶的数量还是 ISAM 键结构)都固定为关系索引时设置的数据(特别是键)。HASH 结构具有固定数量的存储桶(基于索引时的数据),ISAM 表具有固定的键树(同样基于索引时设置的键)。QUEL 有一个命令补救措施是修改与 ISAM 的关系,但这会重新组织整个表,锁定整个表,对并发访问产生相应的影响。

Specifically, the index structures (HASH and ISAM [indexed sequential access method]) in Ingres were static. The index structures (be they number of hash buckets or ISAM key structure) were fixed to the data (specifically key) set at the time the relation was indexed. A HASH structure had a fixed number of buckets (based on data at the time of indexing), and an ISAM table had a fixed key tree (again, based on the key set at the time of indexing). QUEL had a command to remedy this, modify relation to ISAM, but this reorganized the entire table, locking the entire table with the corresponding impact on concurrent access.

这在许多科学用途中是可以接受的,但对于高度并发和 24 小时的用例,这种类型的维护活动对于客户来说变得越来越难以管理。此外,对于系统的新用户来说,这些结构的静态性质并不明显。我无法统计用户抱怨性能问题的次数;我们的第一个问题是“你是否修改为ISAM,然后加载数据?” (这会创建一个 0 级索引,并带有大量溢出页。)答案通常是肯定的,解决方案是修改再次该表(即重新创建索引)。虽然客户很高兴能找到解决方案,但他们常常对事情如此简单感到有些失望。看似困难的问题却得到了微不足道的解决。当然,并非所有性能问题都这么简单。

This was acceptable in many scientific uses, but for highly concurrent and 24-hr use cases, this type of maintenance activity became increasingly difficult for customers to manage. Moreover, to newer users of the system, the static nature of these structures was not obvious. I cannot count the number of times when a user would complain about performance issues; our first question was “Did you modify to ISAM and then load the data?” (This would have created a 0-level index, with a lot of overflow pages.) The answer was often yes, and the solution was to modify the table again (that is, recreate the index). While happy to have a solution, customers were often somewhat disappointed that things were that easy. What seemed like a hard problem had a trivial fix. Of course, not all performance issues were this simple.

为了解决这个问题,我们添加了BTREE存储结构。该结构是 B+ 树,结合了同时研究的各种附加内容(R 树等)。当然,BTREE 提供了一个动态的密钥组织,允许用户的数据随着适当的搜索性能而缩小和增长。当然,对于许多数据类型和用途来说,这是一个巨大的进步。

To address this problem, we added a BTREE storage structure. This structure was a B+ Tree, incorporating various additions from concurrent research (R-trees, etc.). BTREEs, of course, provided a dynamic key organization that allowed users’ data to shrink and grow with appropriate search performance. This was, of course, a big improvement for many types and uses of data.

也就是说,我们确实发现保留静态数据结构是有价值的。对于某些数据集,修复关键结构的能力是完全可以接受的。事实上,在某些情况下,这些提供了更好的性能,部分原因是数据的性质,部分原因是并发成本较低,因为索引不需要(实际上不能)更新。

That said, we did find that keeping the static data structures around was of value. For some datasets, the ability to fix the key structure was perfectly acceptable. Indeed, in some cases, these provided better performance—partly due to the nature of the data, and partly due to the lower concurrency costs because the index need not (indeed, could not) be updated.

“安格尔年代”产生了一个能够正常运行并具有灵活存储管理器概念的系统。可以为特定目的创建索引,并且底层的键控结构使其非常高效。当查询时已知完整键时,哈希结构表可提供出色的查询性能。ISAM 和后来的 BTREE 通过范围查询来实现这一点。各种存储结构提供的这种灵活性,从研究系统扩展到商业系统,为我们的客户提供了良好的服务。

The “Ingres Years” produced a system that functioned and had the notion of a flexible storage manager. Indices could be created for specific purposes, and the underlying keying structure made this very efficient. HASH-structured tables provided excellent query performance when the complete key was known at query time. ISAM and, later, BTREEs, did so with range queries. This flexibility provided by the variety of storage structures, extended from the research system to the commercial one, served our customers well.

课。从研究项目(主要由研究消费者使用)转移到通常具有接近 24 小时访问要求的商业系统,要求我们以不同的方式进行软件开发。对 BTREE 结构的需求就是这一点的体现,需要更改产品功能集以使 Ingres 数据库在某些环境中可行。随着我们的客户增加他们的由于在整个组织中使用 Ingres,RTI 必须升级到 24-7 用例。

Lesson. Moving from a research project (used primarily by research consumers) to a commercial system that often had very near 24-hr access requirements required us to approach software development differently. The need for the BTREE structure was a manifestation of this, requiring a product feature set change to make the Ingres database viable in some environments. As our customers increased their usage of Ingres throughout their organizations, RTI had to step up to the 24-7 use cases.

用户定义类型

User-Defined Types

后来,我们开始看到很多客户有对非传统数据类型进行查询的需求。其原因在文献中有详细记录,但其中最主要的是优化器和/或查询处理器使用特定于域的信息的能力。这可以采用数据存储功能或特定于域的功能的形式。

Later, we began to see that a number of customers had a need to perform queries on non-traditional data types. The reasons are well documented in the literature, but primary among them is the ability for the optimizer and/or query processors to use domain-specific information. This may take the form of data storage capabilities or domain-specific functions.

无论如何,我们都致力于提供这种能力。正如 Michael Carey 所指出的,ADT-Ingres 再次解决了这个问题。我们研究了他们所做的事情并合并了类似的功能(参见第 15 章)。

In any case, we set out to provide this capability. Again, as Michael Carey notes, ADT-Ingres had approached this problem. We looked at what they had done and incorporated similar functionality (see Chapter 15).

当然,到了这个时候,商业 Ingres 代码库的结构与原型代码有很大不同——代码库发生了分歧,优化器也被替换了。我们需要在优化器中完全支持这些用户定义的数据类型,以完全实现高性能查询处理。正如人们可能想象的那样,这增​​加了一定的复杂性。

By this time, of course, the commercial Ingres code base had a vastly different structure from the prototype code—the code base had diverged, and the optimizer had been replaced. We needed to fully support these user-defined data types in the optimizer to fully enable performant query processing. As one might imagine, this added a certain complexity.

与 ADT-Ingres 和 Postgres 一样,每个用户定义类型 (UDT) 都有一个名称、与外部形式(通常是某种形式的字符串)之间的转换以及比较函数。为了提供更通用的输入和错误处理,有一组帮助解析的函数,以及帮助密钥构建和优化的各种函数。

As with ADT-Ingres and Postgres, each user-defined type (UDT) had a name, conversions to/from external forms (typically strings of some form), and comparison functions. To provide for more generic input and error processing, there was a set of functions to aid in parsing, as well as various functions to aid in key construction and optimization.

虽然我们需要能够将有关哈希函数和优化器统计信息的信息传递给 Ingres 运行时代码,但在任何工作之前要求它们,这对我们的客户来说是一项艰巨的任务。(如果我们没记错的话,需要大约 18 个 C 语言函数才能完全实现 UDT。)为了支持增量开发,我们添加了限制功能的能力:可以定义 UDT,以排除以下情况的键控或统计生成:优化器。

While we needed to be able to convey information about hash functions and optimizer statistics to the Ingres runtime code, requiring them before anything would work made it a daunting task for our customers. (If our memory serves, somewhere in the neighborhood of 18 C-language functions were required to fully implement a UDT.) To support incremental development, we added the ability to restrict the functionality: UDTs could be defined that precluded keying or statistics generation for the optimizer.

最终,这些功能被称为对象管理扩展。

Eventually, these capabilities became known as the Object Management Extension.

这里做了大量的工作,将优化器和存储管理器的需求简洁地表达为相对简单的事情,我们的用户(尽管更复杂)可以在不深入了解基于统计的优化器的复杂性的情况下完成这些事情。这里的要求是继续支持使用商业 Ingres 系统的高性能、高并发操作,同时仍然提供我们的客户群所需的功能。

A good deal of work was done here to succinctly express the needs of the optimizer and storage manager into relatively simple things that our users (albeit the more sophisticated ones) could do without an in-depth knowledge of the intricacies of the statistics-based optimizer. The requirement here was to continue to support the high-performance, high-concurrency operations in which the commercial Ingres system was used while still providing the capabilities that our customer base required.

商业 Ingres 中使用的模型与 ADT-Ingres 和 Postgres 中使用的模型类似。尽管我们的代码和客户群不同,但我们确实尝试以这些概念为基础。以密切相关的研究社区的一些想法为基础,这绝对是有帮助的。

The model used in commercial Ingres was similar to that used in ADT-Ingres and Postgres. We definitely tried to build on the concepts, although our code and customer bases were different. It was most definitely a help to build on some of the thinking from that closely related research community.

课。随着 Ingres 从研究转向关键业务用例,对可靠数据访问的需求不断增加。当允许 Ingres 用户(尽管是系统程序员,而不是最终用户)更改 Ingres 服务器代码时,这带来了许多挑战,特别是在数据完整性方面。实现 Ingres 的 C 语言在这里没有提供太多保护。我们考虑在单独的进程中构建 UDT 系统,但出于性能原因,这确实不可行。因此,我们尽可能地检查和保护所有数据访问,并将文档放置在非常靠近红色闪烁警告灯的位置。考虑到当时软件的状态,这是可以做的最好的事情。在随后的几年里,语言和系统设计取得了许多进步,所以我们今天可能会采取不同的做法。但编写“用户扩展”始终是一种权衡。任何在商业环境中提供这些服务的人都必须仔细考虑这些权衡。

Lesson. As Ingres moved from research to business-critical use cases, the need for reliable data access increased. When allowing Ingres users (albeit system programmers, not end users) to alter the Ingres server code, this presented a number of challenges, specifically with respect to data integrity. The C language, in which Ingres is implemented, did not provide much protection here. We looked at building the UDT system in a separate process, but that really wasn’t feasible for performance reasons. Consequently, we checked and protected all data access as much as possible and placed very close to red flashing warning lights in the documentation. Given the state of the software at the time, that was the best that could be done. In the intervening years, there have been numerous advancements in language and system design, so we might do things differently today. But writing “user extensions” is always a tradeoff. These tradeoffs must be carefully considered by anyone providing these in a commercial environment.

结论

Conclusions

不同代码库中的共享起点使我们能够直接利用正在进行的研究。当然,由于多种原因总是存在差异,但我们能够很好地利用作品并使其适应商业体系。

The shared beginning in separate code bases allowed us to make direct use of the ongoing research. There were always differences for many reasons, of course, but we were able to make good use of the work and adapt it to the commercial system.

迈克的知识和远见帮助推动了商业努力,使安格尔被公认为有远见的领导者。迈克的领导力,即持续的研究工作和他对产品方向的洞察力,对于 Ingres 产品和公司以及我们个人来说都非常重要。

Mike’s knowledge and vision helped drive the commercial effort, resulting in Ingres’ being recognized as the visionary leader. Mike’s leadership, in the form of the ongoing research efforts and his insight into the product direction, were tremendously important to the Ingres product and company and to us personally.

我们过去和现在都非常感谢我们持续的关系。

We were and are very grateful for our ongoing relationship.

开源安格尔

Open Source Ingres

如今,Ingres 是世界上下载和使用最广泛的开源 DBMS 之一。开源 Ingres 有多个版本(不同的代码线)。Ingres 数据库(见下文)是原始的开源 Ingres,基于 Computer Associates (CA) 捐赠的软件。Ingres Corporation(以前的 RTI)被 ASK 收购,ASK 随后又被 Computer Associates 收购。一段时间后,CA 为 Ingres 分拆了一家独立的企业,该企业参与了一系列收购和更名,包括 VectorWise BV、Versant、Pervasive 和 ParAccel(参见维基百科)。该企业现在是 Actian。Actian 发布了 Ingres 的开源版本,称为 Ingres Database 10。

Today, Ingres is one of the world’s most widely downloaded and used open-source DBMSs. There are multiple versions (different codelines) of open-source Ingres. The Ingres Database (see below) is the original open-source Ingres, based on a software donation by Computer Associates (CA). Ingres Corporation (formerly RTI) was acquired by ASK, which was subsequently acquired by Computer Associates. After a time, CA spun out a separate enterprise for Ingres, and that enterprise was involved in a number of acquisitions and name changes including VectorWise BV, Versant, Pervasive, and ParAccel (see Wikipedia). That enterprise is now Actian. Actian released an open-source version of Ingres, called Ingres Database 10.

以下内容是从2018年3月14日开源网页Ingres数据库、BlackDuck OpenHub(http://openhub.net/p/ingres)复制的。随后数据被删除。

The following was copied from the open source web page: Ingres Database, BlackDuck OpenHub (http://openhub.net/p/ingres) on March 14, 2018. The data was subsequently deleted.

项目总结

Project Summary

Ingres Database 是一款开源数据库管理系统,可以降低 IT 成本和价值实现时间,同时提供企业级数据库所需的强度和功能。Ingres Database 是支持关键业务应用程序并帮助管理财富 500 强公司最苛刻的企业应用程序的领导者。Ingres 专注于可靠性、安全性、可扩展性和易用性,包含企业所需的功能,同时提供开源的灵活性。核心 Ingres 技术不仅构成了 Ingres 数据库的基础,而且还构成了许多其他行业领先的 RDBMS 系统的基础。

Ingres Database is the open source database management system that can reduce IT costs and time to value while providing the strength and features expected from an enterprise class database. Ingres Database is a leader in supporting business critical applications and helping manage the most demanding enterprise applications of Fortune 500 companies. Focused on reliability, security, scalability, and ease of use, Ingres contains features demanded by the enterprise while providing the flexibility of open source. Core Ingres technology forms the foundation, not only of Ingres Database, but numerous other industry-leading RDBMS systems.

简而言之,Ingres 数据库……

In a Nutshell, Ingres Database …

• 74 位贡献者做出了 3,978 次提交,代表了 3,761,557 行代码;

•  has had 3,978 commits made by 74 contributors representing 3,761,557 lines of code;

• 大部分是用C 语言编写的,源代码注释非常完善;

•  is mostly written in C with a very well-commented source code;

• 拥有由一名开发人员维护的完善、成熟的代码库,并且提交量逐年减少;和

•  has a well-established, mature codebase maintained by one developer with decreasing Y-O-Y commits; and

• 从 2008 年 3 月的第一次提交到大约 2 年前的最近一次提交,估计花费了 1,110 年的努力(COCOMO 模型)。

•  took an estimated 1,110 years of effort (COCOMO model) starting with its first commit in March, 2008 ending with its most recent commit about 2 years ago.

1 . Mike 的这一主要贡献将在第1、3、12和15讨论

1. This major contribution of Mike’s is discussed in Chapters 1, 3, 12, and 15).

25

25

Postgres 和 Illustra 代码线

The Postgres and Illustra Codelines

魏红

Wei Hong

我从 1989 年到 1992 年从事 Postgres 工作,从 1992 年到 1997 年从事 Illustra,然后断断续续地从事 Postgres 的分支几年。Postgres 是我生活中如此重要的一部分,以至于我用其中听起来好听的名字来给我的猫命名:Febe(前端-后端,发音为 Phoebe)和 Ami(访问方法接口,发音为 Amy)。我第一次学习 RDBMS 是在 1985 年,在中国的清华大学,通过 Ingres 代码库。当时,开源软件不被允许发布到中国。然而,我和我的顾问偶然发现了一盒整个 Ingres 代码库的行式打印机打印输出。我们煞费苦心地将源代码重新输入计算机并设法使其运行,最终变成了我的硕士论文。Postgres 中的大部分基本数据结构都是从 Ingres 演变而来的。从一开始我就对 Postgres 代码感到很熟悉。

I worked on Postgres from 1989–1992, on Illustra from 1992–1997, and then on offshoots of Postgres on and off for several years after that. Postgres was such a big part of my life that I named my cats after nice-sounding names in it: Febe (Frontend-Backend, pronounced Phoebe) and Ami (Access Method Interface, pronounced Amy). I first learned RDBMS at Tsinghua University in China with the Ingres codebase in 1985. At the time, open-source software was not allowed to be released to China. Yet, my advisor and I stumbled across a boxful of line-printer printouts of the entire Ingres codebase. We painstakingly re-entered the source code into a computer and managed to make it work, which eventually turned into my master’s thesis. Most of the basic data structures in Postgres evolved from Ingres. I felt at home with Postgres code from the beginning. The impact of open-source Ingres and Postgres actually went well beyond the political barriers around the world for that era.

Postgres:学术原型

Postgres: The Academic Prototype

1989 年夏天,也就是他著名的横跨美国东西海岸的自行车之旅之后的那个夏天,我加入了迈克尔·斯通布雷克 (Michael Stonebraker) 的研究小组。1当时,该小组的全部重点是从代码库中消除 Lisp。正如 Mike 所说,该团队“被人工智能的前景所吸引”,并选择结合 Lisp 和 C 来实现 Postgres。结果是系统速度慢得可怕。遭受语言边界周围的大量内存泄漏2由于不及时的垃圾收集而导致不可预测的性能。该团队被 Lisp 吸引的部分原因是它良好的开发环境。然而,Lisp/C 接口下方缺乏任何符号调试(因为 C 目标文件被动态加载到剥离的商业 Lisp 二进制文件的运行映像中),迫使团队使用原始的“高级”调试器 (adb) 进行调试仅来自堆栈和原始内存/机器代码的函数名称!因此,我们花了整个夏天将 Lisp 转换为 C,并实现了一个数量级的性能提升。我仍然清楚地记得加州大学伯克利分校 Evans Hall 608-3 室的欢呼声,以及 Mike 多么高兴看到最简单的 PostQuel 语句“retrieve (1)”在我们新的 Lisp-free 系统中端到端地工作。这是一个重要的里程碑。

I joined Michael Stonebraker’s research group in the summer of 1989, the summer after his now-famous cross-America coast-to-coast bike trip.1 At the time, the group’s entire focus was on eliminating Lisp from the codebase. The group had been “seduced by the promise of AI,” as Mike puts it, and opted to implement Postgres in a combination of Lisp and C. The result was a horrendously slow system suffering from massive memory leaks around the language boundaries2 and unpredictable performance due to untimely garbage collection. The team was drawn to Lisp partially because of its nice development environment. However, the lack of any symbolic debugging below the Lisp/C interface (as the C object files were dynamically loaded into the running image of a stripped commercial Lisp binary) forced the team to debug with the primitive “advanced” debugger (adb) with only function names from the stack and raw memory/machine code! So, we spent the whole summer converting Lisp to C and achieving performance gains by an order of magnitude. I can still vividly remember the cheers in Evans Hall Room 608-3 on the UC Berkeley campus and how pleased Mike was to see the simplest PostQuel statement “retrieve (1)” work end to end in our new Lisp-free system. It was a big milestone.

Stonebraker 在开源软件领域取得成功的一个秘诀是,他总是为每个项目聘请一名全职首席程序员。与 Ingres 一样,Postgres 是由本科生兼职程序员和研究生小组开发的,他们最终希望发表有关系统工作的论文。首席程序员是将整个系统整合在一起并通过邮件列表支持世界各地用户群的关键。Postgres 拥有一批非常有才华的程序员,其中包括本科生和研究生。然而,他们来了又走,缺乏一致性。他们的动机主要是玩迈克最新最好的工作站、与很酷的人/项目在一起、和/或为出版物制作原型创意。3首席程序员面临着巨大的挑战。尽管该系统非常适合演示和学术实验,但它远非健壮、可靠或易于使用。

One secret to Stonebraker’s success in open-source software was that he always hired a full-time chief programmer for each project. Like Ingres, Postgres was developed by groups of undergraduate part-time programmers and graduate students who ultimately want to publish papers on their work on the systems. The chief programmers were the key to hold the whole system together and to support the user base around the world through mailing lists. Postgres had groups of very talented programmers, both undergraduate and graduate students. However, they came and went and lacked consistency. Their motivations were mostly around playing with Mike’s latest and greatest workstations, hanging around cool people/projects, and/or prototyping ideas for publications.3 The chief programmers had a huge challenge on their hands. Even though the system was great for demos and academic experiments, it was nowhere near robust, reliable, or easy to use.

1991年,年轻的首席程序员杰夫·梅雷迪思(Jeff Meredith)上任(迈克图灵奖演讲中的“安静”)。无论出于什么原因,大多数本科生程序员都消失了,而迈克的大多数研究生要么毕业了,要么从事不相关的主题。Postgres v3.1 系统状况不佳,存在许多明显的错误。Jeff 的任务是制作 v4.0 以使其发挥更大作用更可用和可靠。我被招募来与 Stonebraker 的同学 Joe Hellerstein 和 Mike Olson(迈克的图灵讲座中的“Triple Rock”)一起提供帮助。

In 1991, in came the young chief programmer Jeff Meredith (“Quiet” in Mike’s Turing Award lecture). For whatever reasons at the time, most of the undergraduate programmers disappeared and most of Mike’s graduate students either graduated or were working on unrelated topics. The system Postgres v3.1 was not in good shape, with lots of glaring bugs. Jeff was tasked to produce v4.0 to make it much more usable and reliable. I was recruited to help along with fellow Stonebraker students Joe Hellerstein and Mike Olson (“Triple Rock” in Mike’s Turing lecture).

我们花了很多个漫长的日日夜夜来清理和重写许多片断的代码,只是偶尔休息一下,一起玩一个原始的在线红心游戏。我修复的最令人难忘的区域是缓冲区管理器。当时,Postgres 遭受了严重的缓冲区页泄漏,因为代码的许多部分在释放缓冲区时不小心。修复所有漏洞是一个艰苦的过程。最后,我确信我已经堵住了所有泄漏,因此我添加了一条错误消息,告诉人们如果再次检测到泄漏,请与我联系。我还在代码中添加了注释,以确保人们会遵循我释放缓冲区的约定,否则!今天我不敢在 Postgres 代码库中搜索我的名字。我当然希望没有人看到错误消息和那些包含我名字的评论!Postgres v4。

We spent many long days and nights cleaning up and rewriting many sections of flaky code, with only occasional breaks to play a primitive online game of Hearts together. The most memorable area that I fixed is in the buffer manager. At the time, Postgres suffered from major buffer page leaks because many parts of the code were careless in releasing buffers. It was a painstaking process to fix all the leaks. In the end, I was so sure that I got all the leaks plugged that I put in an error message telling people to contact me if a leak were ever detected again. I also put comments all over the code to make sure that people would follow my convention of releasing buffers or else! I don’t dare to search for my name in the Postgres codebase today. I certainly hope that no one ever saw the error message and those comments containing my name! Postgres v4.0 was finally released with much improved query semantics and overall reliability and robustness.

斯通布雷克在图灵演讲结束时指出,他创建的所有成功系统都拥有“一批超级明星研究程序员”。我认为迈克职业生涯成功的关键因素之一是他吸引和留住此类人才的能力。尽管代码库很混乱,错误数量不断增加,并且迈克通常对我们的日程安排进行的“三行代码”估计带来了巨大的压力,但我们作为 post_hackers 仍然获得了很多乐趣。4每当有值得庆祝的里程碑或镇上有外来访客时,迈克总是带我们出去喝啤酒和吃披萨。他会和我们一起闲逛一段时间,然后留下一把 20 美元的钞票让我们继续“交流研究想法”。我们亲切地将这些会议称为“学习小组”。我们中的一些人仍然在伯克利周围延续着这一传统。Postgres 的经历在我们之间建立了终生的友谊。

Stonebraker pointed out at the end of his Turing lecture that all the successful systems he created had “a collection of superstar research programmers.” I think that one of the key ingredients to Mike’s successful career is his ability to attract and retain such talent. Despite the messy codebase, mounting number of bugs, and high pressure from Mike’s usual “3 lines of code” estimates for our schedules, we managed to have a lot of fun together as post_hackers.4 Mike always took us out for beer and pizza whenever there was a milestone to celebrate or an external visitor in town. He would hang out with us for a while and then leave behind a fistful of $20 dollar bills for us to continue “exchanging research ideas.” We fondly called these sessions “study groups.” Some of us still continue this tradition around Berkeley. The Postgres experience built lifelong friendships among us.

Illustra:“为了钱而做”

Illustra: “Doing It for Dollars”

到 1992 年春天,Postgres 已经成为一个成功的开源5项目,在全球拥有数百个下载量和数千名用户。我们一直很好奇一些来自俄罗斯的用户正在使用 Postgres 做什么。还有一个规则系统的用户,其规则涉及战斧导弹。6我们当然希望这不是真的!

By Spring 1992, Postgres was already a successful open-source5 project, with a few hundred downloads and a few thousand users around the world. We were always curious about what some of the users from Russia were doing with Postgres. There was also a user of the rule systems with rules involving Tomahawk missiles.6 We certainly hoped that it was not real!

图像

图 25.1   Miro 团队。后排,从左到右:Cimarron Taylor、Donna Carnes、Jeff Meredith(“安静”)、Jim Shankland、魏红(“EMP1”)、Gary Morgenthaler(“高鲨鱼”);中排:Mike Ubell(“Short One”)、Ari Bowes、Jeff Anton;前排:迈克·斯通布雷克、保拉·霍索恩(“妈妈”)和理查德·恩伯森。

Figure 25.1  The Miro team. Back row, left to right: Cimarron Taylor, Donna Carnes, Jeff Meredith (“Quiet”), Jim Shankland, Wei Hong (“EMP1”), Gary Morgenthaler (“Tall Shark”); middle row: Mike Ubell (“Short One”), Ari Bowes, Jeff Anton; front row: Mike Stonebraker, Paula Hawthorn (“Mom”), and Richard Emberson.

除了开发 Postgres 之外,我在并行查询处理的研究上也取得了良好的进展,并准备毕业。Mike 正在将 Postgres 作为一个研究项目,并考虑将其商业化。当他第一次联系我和杰夫·梅雷迪思(Jeff Meredith)加入他的商业化工作时,我成功地比杰夫提前半小时到达迈克的办公室,并成为 Illustra 公司的第一号员工。因此,我在迈克的图灵奖演讲中得到了“EMP1”的绰号。当面临设计决策时,迈克经常向我们讲授“为金钱而做的人”会做什么……我们很高兴终于能自己用金钱来做这件事!

Developing Postgres aside, I also managed to make good progress on my research on parallel query processing and was getting ready to graduate. Mike was wrapping up Postgres as a research project and contemplating its commercialization. When he first approached me and Jeff Meredith to join him in the commercialization effort, I managed to beat Jeff to Mike’s office by half an hour and became employee #1 of Illustra the company. Hence my nickname of “EMP1” in Mike’s Turing Award lecture. When faced with design decisions, Mike often lectured us on what “people who do it for dollars” would do .… We were so excited to finally get to do it for dollars ourselves!

Illustra Information Technologies, Inc.(当时称为 Miro Systems, Inc.,[Stonebraker 1993c])于 1992 年夏天成立并运行。

Illustra Information Technologies, Inc., (then called Miro Systems, Inc., [Stonebraker 1993c]) was up and running in the summer of 1992.

我们在 Illustra 之初的主要任务是 (1) 让 Postgres 做好生产准备,(2) 用 SQL 替换 PostQuel(请参阅第 35 章,以及 (3) 制定上市计划。

Our main mission at the beginning of Illustra was to (1) make Postgres production-ready, (2) replace PostQuel with SQL (see Chapter 35), and (3) figure out a go-to-market plan.

1.  生产化 Postgres。尽管我们作为 post_hackers 尽了最大努力,但 Postgres 还远未做好生产准备。幸运的是,迈克能够招募到一些前研究生——Paula Hawthorn 和 Mike Ubell(在迈克的图灵演讲中分别称为“妈妈”和“短一”)——以及安格尔时代的首席程序员杰夫·安东(Jeff Anton),他们都已经成长为那时已是颇有成就的行业资深人士。不知何故,两代 Stonebraker 学生和首席程序员能够立即团结为一个团队。这些退伍军人在Postgres的产品化过程中发挥了巨大的作用。他们向我们灌输这样的理念:生产数据库系统绝不能损坏或丢失客户的数据。他们引入了 Purify 等工具来帮助消除内存损坏。他们修复或重写了关键但经常被忽视的命令,例如真空,这很容易导致数据损坏或永久丢失。他们还教我们如何在处理难以捉摸的 heisenbugs 7时修补损坏的磁盘页面(除了论文之外)那是不可能捕捉到的。他们还教会了我们测试的重要性。在 Illustra 之前,Postgres 仅通过运行演示或论文写作基准测试来进行测试。当我们开始第一个回归测试套件时,有人带了一个鸡面具到办公室。我们让打破回归测试的人戴上鸡面具一天。该测试套件被称为“鸡测试”。最终,我们建立了一支实力雄厚的质量保证团队,测试范围从综合覆盖率测试和 SQL92 合规性测试到标准基准测试、客户特定测试等等。最后,我们准备将 Illustra 作为适合生产的 DBMS 出售!

1.  Productionize Postgres. Despite our best efforts as post_hackers, Postgres was nowhere near production-ready. Luckily Mike was able to recruit a few former graduate students—Paula Hawthorn and Mike Ubell (“Mom” and “Short One,” respectively, in Mike’s Turing lecture)—and chief programmer Jeff Anton from the Ingres days, who had all grown into accomplished industry veterans by then. Somehow the two generations of Stonebraker students and chief programmers were able to bond as a team immediately. The veterans played a huge role in the productization of Postgres. They instilled in us that a production database system must never, ever corrupt or lose customers’ data. They brought in tools like Purify to help eradicate memory corruptions. They fixed or rewrote critical but often neglected commands like vacuum which could easily cause data corruption or permanent loss. They also taught us how to patch corrupted disk pages with all but dissertation when dealing with elusive heisenbugs7 that were impossible to catch. They also taught us the importance of testing. Before Illustra, Postgres was only tested by running demos or benchmarks for paper writing. When we started the first regression test suite, someone brought a chicken mask to the office. We made whoever broke the regression test wear the chicken mask for a day. This test suite became known as the “Chicken Test.” Eventually we built a substantial quality assurance team with tests ranging from synthetic coverage tests and SQL92 compliance tests to standard benchmark tests, customer-specific tests, and more. Finally, we were ready to sell Illustra as a production-worthy DBMS!

2.   SQLize Postgres。Stonebraker 将 SQL 称为“星际数据语言”。“从一开始就很清楚我们必须用 SQL 取代 PostQuel。这主要意味着用 Postgres 的可扩展类型系统和对复合类型的支持来扩展 SQL92。扩展 SQL 并不难。我们花费最多时间的是赶上 Postgres 所缺乏的普通 SQL92 功能。例如,完整性约束和视图是由Postgres规则系统实现的,它不符合SQL92。我们必须从头开始完全重写它们。无论如何,Postgres 规则系统非常复杂且有缺陷。因此,正如迈克所说,我们“把他们推下了悬崖”。我们花了很长时间才完成的其他 SQL 功能包括嵌套子查询、本地化字符串和日期/时间/小数/数字类型。最后,我们能够通过所有 SQL92 入门级测试,这是一个巨大的里程碑。

2.  SQLize Postgres. Stonebraker famously called SQL the “Intergalactic Data Speak. “It was clear from the beginning that we had to replace PostQuel with SQL. This primarily means to extend SQL92 with Postgres’ extensible type system and the support for composite types. It was not hard to extend SQL. What took us the most time was to catch up on the vanilla SQL92 features which Postgres lacked. For example, integrity constraints and views were implemented by the Postgres rule system, which was not compliant with SQL92. We had to rewrite them completely from scratch. The Postgres rule systems were incredibly complex and buggy anyway. So, we “pushed them off a cliff,” as Mike would say. Other SQL features that took us a long time to complete included nested subqueries, localized strings, and date/time/decimal/numeric types. Finally, we were able to pass all SQL92 entry-level tests, which was a huge milestone.

3.  进入市场。正如 Joe Hellerstein 在第 16 章中指出的那样,Postgres 充满了伟大的研究想法,其中大多数都远远领先于他们的时代。除了技术挑战之外,我们还面临着制定上市计划的巨大挑战。我们回顾了 Postgres 中的每个独特功能,并讨论了如何“以美元”推销它:

3.  Go-to-Market. As Joe Hellerstein points out in Chapter 16, Postgres was jam-packed with great research ideas, most of which were far ahead of their time. In addition to the technical challenges, we had a huge challenge on our hands to figure out a go-to-market plan. We went over each unique feature in Postgres and discussed how to market it “for dollars”:

(a)   ADT:关系模型的一个很好的扩展,易于市场消化,但听起来太“抽象”了!作为研究人员,我们会自豪地推销我们的类型系统是多么“抽象”、“复杂”和“可扩展”,但客户需要一些具体和简单的东西来关联。

(a)  ADTs: a great extension to the relational model, easy for the market to digest, but sounded way too “abstract”! As researchers, we would proudly market how “abstract,” “complex,” and “extensible” our type system was, but customers needed something concrete and simple to relate to.

(b)  规则、主动/演绎数据库:将 Postgres 规则系统推向“悬崖”是一个简单的决定,因为正如 Joe 所说,当时是“AI 冬天”。

(b)  Rules, Active/Deductive Databases: It was an easy decision to push both Postgres rule systems “off a cliff” because this was an “AI Winter” at the time, as Joe puts it.

(c)  时间旅行。市场并不在乎,由于 Stonebraker 的无覆盖存储系统,Postgres 的崩溃恢复代码变得更加简单。对于大众市场客户来说,时间旅行功能的额外好处也太深奥了。我们保留了不可覆盖的存储系统,但除了现在之外,没有客户曾到过任何地方。

(c)  Time Travel. The market couldn’t care less that Postgres’ crash recovery code was much simpler thanks to Stonebraker’s no-overwrite storage system. The additional benefit of the Time Travel feature was also too esoteric for the mass-market customers. We kept the no-overwrite storage system, but no customers ever time-traveled to anywhere but NOW.

(d)  并行 DBMS:这是我的最爱,因为它是我的论文。不幸的是,任何形式的共享或不共享并行 DBMS 的市场都太小,以至于当时的初创企业无法押注。

(d)  Parallel DBMS: This was near and dear to my heart because it was my dissertation. Unfortunately, the market for any form of shared something or nothing parallel DBMS was too small for a start-up to bet on at the time.

(e)  快速路径:这是我们通过绕过 SQL 开销来对抗 OODBMS(面向对象 DBMS)供应商在基准测试中声称的性能主张的秘密武器。最终,这个界面的级别太低,客户无法采用。它仍然是我们赢得基准战的秘密武器。

(e)  Fast Path: This was our secret weapon to combat the performance claims by OODBMS (Object-Oriented DBMS) vendors on benchmarks by bypassing the SQL overhead. Ultimately this was too low level an interface for customers to adopt. It just remained as our secret weapon to win benchmarking battles.

我们清楚地意识到,我们必须在我们的市场战略中利用 Postgres 的 ADT 系统。Stonebraker 很快应用了他传奇的四边形图分析(参见第 6 章图 6.2)来绘制市场图,这使我们作为新一代 DBMS 占据了右上象限(总是!)的主导地位:对象关系 DBMS (ORDBMS) ,OO 和关系 DBMS 两全其美。人们仍然很难理解这样一个没有具体应用的通用系统。我们使用剃刀和刀片进行类比,并创造了术语“DataBlade”它是数据类型、方法和访问方法的集合。我们知道,我们必须构建一些具有商业吸引力的数据刀片来启动市场。因此,在 1995 年,公司重组为三个业务部门: 金融部门,以 TimeSeries DataBlade 瞄准华尔街;多媒体,针对媒体公司提供文本和图像搜索 DataBlade;和Web,针对当时新兴的万维网。尽管有三个业务部门,但大部分代码线开发都集中在支持 TimeSeries DataBlade 上,因为公司预计大部分收入来自金融业务部门,而 TimeSeries 是一个实施起来极具挑战性的 DataBlade。做好时间序列可以成为一家独立的公司。华尔街的客户是出了名的精明且要求严格。尽管我们已经进入了华尔街的大多数大公司,但每笔试点交易都经过了我们顶级工程师的艰苦努力。我在华尔街花了很多时间尝试优化我们的系统,以便我们的性能接近他们专有的时间序列系统。我记得我不得不熬夜重写我们的外部排序模块以满足一些客户的性能要求。尽管我们在华尔街做出了许多英雄事迹,但公司从这些努力中获得的收入却微乎其微。

It became clear to us that we must capitalize on Postgres’ ADT system in our go-to-market strategy. Stonebraker quickly applied his legendary quad chart analysis (see Chapter 6, Figure 6.2) to map out the market, which put us dominantly at the upper-right quadrant (always!) as the new breed of DBMS: Object-Relational DBMS (ORDBMS), the best of both worlds of OO and Relational DBMS. It was still difficult for people to wrap their heads around such a generic system without concrete applications. We used the razor-and-blade analogy and coined the term “DataBlade,” which is a collection of data types, methods, and access methods. We knew that we must build some commercially compelling datablades to jump-start the market. So, in 1995 the company was reorganized into three business units: Financial, targeting Wall Street with TimeSeries DataBlade; Multimedia, targeting media companies with text and image search DataBlade; and Web, targeting the then-emerging World Wide Web. Even though there were three business units, most of the codeline development was concentrated on supporting the TimeSeries DataBlade because the company expected most of the revenues from the Financial business unit, and TimeSeries was an incredibly challenging DataBlade to implement. Doing time series well can be an entire company of its own. Wall Street customers are notoriously savvy and demanding. Even though we made inroads to most of the major firms on Wall Street, each pilot deal was hard fought by our top engineers. I did my time on Wall Street trying to optimize our system so that our performance would come close to their proprietary time series systems. I remember having to stay up all night to rewrite our external sorting module to meet some customer’s performance requirement. Despite all our heroics on Wall Street, the company had very little revenue to show for from the efforts.

与此同时,多媒体业务部门通过与文本和图像搜索供应商合作而步履蹒跚,而网络业务部门——只有一些简单的数据类型用于处理网页,工程师也很少——在营销方面取得了巨大的吸引力。我们成为“网络空间数据库”并最终被 Informix 收购。

In the meantime, the Multimedia business unit limped along by partnering with text and image search vendors while the Web business unit—with only some simple data types for handling web pages and few engineers—made huge traction marketing-wise. We became the “Database for Cyberspace” and eventually were acquired by Informix.

在 Informix 收购之后,Illustra 代码线上的所有活跃开发基本上都停止了。Illustra 工程师与 Informix 同行合作启动了一个名为 Informix Universal Server 的新代码线。它基于 Informix Dynamic Server 系列,对整个代码库进行了大量更改以支持对象关系功能。除了合并具有完全不同文化的团队所面临的组织挑战之外,Illustra 工程师面临的主要技术挑战是从 Illustra 的多进程单线程环境过渡到 Informix 自主开发的非抢占式多线程环境。尽管 Informix 面临管理和财务困境,合并后的团队最终还是取得了成功。该代码线至今仍为 IBM 的 Informix Universal Server 产品提供支持。

All active development on the Illustra codeline essentially stopped after the Informix acquisition. Illustra engineers were teamed up with Informix counterparts to start a new codeline called Informix Universal Server. It was based on the Informix Dynamic Server line with extensive changes throughout the codebase to support Object-Relational features. Aside from the organizational challenges in merging teams with completely different cultures, the main technical challenge for the Illustra engineers was to go from Illustra’s multi-process single-thread environment to Informix’s homegrown, non-preemptive multithreaded environment. Despite Informix’s management and financial woes, the combined team ultimately succeeded. That codeline still lives today powering IBM’s Informix Universal Server product.

PostgreSQL 及其他

PostgreSQL and Beyond

当我们在 Illustra 忙于构建和销售 DataBlades 时,回到伯克利校园,Stonebraker 的学生 Jolly Chen 和 Andrew Yu(在 Mike 的图灵讲座中分别是快乐和严肃)决定他们已经受够了 PostQuel并在开源代码库上做了他们自己的 SQLization 项目。他们将其发布为 Postgres95。事实证明,这是开源 Postgres 真正起飞的转折点。8尽管 Postgres 始终是开源的,但9其核心是由加州大学伯克利分校的学生和工作人员专门开发的。1996年4月,陈乔利向公众发出志愿者征集。10与伯克利无关的局外人 Marc Fournier 站出来维护并发版本系统 (CVS) 存储库并运行邮件列表。然后,一个“志愿者拾取团队”神奇地围绕它自行形成,将 Postgres 从伯克利学生手中接管到这个随机分布在世界各地的团队中。Postgres 成为了 PostgreSQL,剩下的就是历史了……

While we were busy building and selling DataBlades at Illustra, back on the Berkeley campus, Stonebraker students Jolly Chen and Andrew Yu (Happy and Serious, respectively, in Mike’s Turing lecture) decided that they had had enough of PostQuel and did their own SQLization project on the open-source codebase. They released it as Postgres95. This turned out to be the tipping point for open-source Postgres to really take off.8 Even though Postgres was always open source,9 its core was developed exclusively by UC Berkeley students and staff. In April 1996, Jolly Chen sent out a call for volunteers to the public.10 An outsider, unrelated to Berkeley, Marc Fournier, stepped up to maintain the Concurrent Version Systems (CVS) repository and run the mailing lists. Then a “pickup team of volunteers” magically formed around it on its own to take over Postgres from Berkeley students to this randomly distributed team around the world. Postgres became PostgreSQL and the rest is history.…

开源 PostgreSQL

Open Source PostgreSQL

PostgreSQL 是世界上下载和使用最广泛的开源 DBMS 之一。由于几乎每个 Linux 发行版都附带 PostgreSQL,因此无法估计正在使用的副本数量。Linux Counter 估计目前有超过 165,000 台机器运行 Linux,用户数量超过 600,000(上次访问时间为 2018 年 3 月 14 日)。

PostgreSQL is one of the world’s most widely downloaded and used open-source DBMSs. It is impossible to estimate how many copies are in use since PostgreSQL ships with almost every distribution of Linux. The Linux Counter estimates that over 165,000 machines currently run Linux with over 600,000 users (last accessed March 14, 2018).

以下内容来自其开源网页:PostgreSQL 数据库服务器、BlackDuck OpenHub。上次访问时间为 2018 年 3 月 7 日。

The following is from its open-source web page: PostgreSQL Database Server, BlackDuck OpenHub. Last accessed March 7, 2018.

“项目总结

“Project Summary

PostgreSQL 是一个功能强大的开源关系数据库系统。它经过超过 15 年的积极开发和经过验证的架构,使其在可靠性、数据完整性和正确性方面赢得了良好的声誉。它可以在所有主要操作系统上运行,包括 Linux、UNIX(AIX、BSD、HP-UX、SGI IRIX、Mac OS X、Solaris、Tru64)和 Windows。

PostgreSQL is a powerful, open source relational database system. It has more than 15 years of active development and a proven architecture that has earned it a strong reputation for reliability, data integrity, and correctness. It runs on all major operating systems, including Linux, UNIX (AIX, BSD, HP-UX, SGI IRIX, Mac OS X, Solaris, Tru64), and Windows.

简而言之,PostgreSQL 数据库服务器……

In a nutshell, PostgreSQL Database Server …

•  64 位贡献者做出了 44,391 次提交,代表了 936,916 行代码;

•  has had 44,391 commits made by 64 contributors representing 936,916 lines of code;

• 大部分是用C 编写的,具有注释良好的源代码;

•  is mostly written in C with a well-commented source code;

• 拥有完善、成熟的代码库,由大型开发团队维护,并具有稳定的年度提交;和

•  has a well-established, mature codebase maintained by a large development team with stable Y-O-Y commits; and

• 从 1996 年 7 月的第一次提交到 4 天前的最近一次提交,估计花费了 262 年的努力(COCOMO 模型)” http://www.postgresql.org。上次访问时间为 2018 年 4 月 12 日。

•  took an estimated 262 years of effort (COCOMO model) starting with its first commit in July, 1996 ending with its most recent commit 4 days ago” http://www.postgresql.org. Last accessed April 12, 2018.

最后的想法

Final Thoughts

Stonebraker 花了十年时间“努力让 Postgres 成为现实”[Stonebraker 2016]。我很幸运能够成为这场“斗争”的重要组成部分。作为 Postgres/Illustra 开发人员,我们有一个一般规则,就是不要让 Mike 靠近代码存储库。然而,贯穿始终的代码线在很大程度上受到他的愿景、想法和对市场领域“右上象限”主导地位的不懈努力的影响。在他的图灵演讲中,Mike 对“我骑在他们的肩膀上”的超级明星程序员给予了很多信任 [Stonebraker 2016]。另一方面,许多程序员也骑在迈克的肩膀上,成长为工业界和学术界的新一代领导者。

Stonebraker spent a decade “struggling to make Postgres real” [Stonebraker 2016]. I have been lucky enough to be a large part of this “struggle.” As Postgres/Illustra developers, we had a general rule of not letting Mike anywhere near the code repositories. However, the codelines throughout were heavily influenced by his vision, ideas, and relentless push to dominate in the “upper-right quadrant” in the market sector. In his Turing lecture, Mike gave a lot of credit to the collection of superstar programmers “on whose shoulders I’ve ridden” [Stonebraker 2016]. On the flip side, so many of these programmers have also ridden on Mike’s shoulders and grown into new generations of leaders in both industry and academia.

1 . 这就是图灵奖演讲“陆地鲨鱼在尖叫箱上”背后的故事。

1. This is the story behind the Turing Award lecture “The land sharks are on the squawk box.”

2 . “更不用说 Lisp/C 接口下缺乏任何符号调试,因为 C 目标文件被动态加载到剥离的商业 Lisp 二进制文件的运行映像中。使用 adb 进行调试,仅使用堆栈中的函数名称和原始内存/机器代码 — 美好时光!”—Paul Aoki,Postgres 团队成员。

2. “Not to mention the lack of any symbolic debugging below the Lisp/C interface, as the C object files were dynamically loaded into the running image of a stripped commercial Lisp binary. Debugging with adb, with only the function names from the stack and raw memory/machine code—good times!”—Paul Aoki, Postgres team member.

3 . “我怀疑许多本科生的主要动机是金钱,因为本科生程序员的小时工资与在餐厅工作或在图书馆书架上书的人完全相同。很长一段时间以来,人们的吸引力在于获得更好的硬件(你自己的 Unix 工作站!)以及与很酷的人一起做很酷的事情。但除了“六年计划”中的黑客之外,他们都会在一年左右的时间内流失……”——Paul Aoki

3. “I doubt many undergrads were primarily motivated by money since the hourly pay for an undergrad programmer was exactly the same as the people who worked at the dining hall or shelved books at the library. For a long time, the draw was access to better hardware (your own Unix workstation!) and working with cool people on something cool. But except for the hackers on the “six-year plan,” they’d all churn in a year or so .…”—Paul Aoki

4 . post_hackers@cs.berkeley.edu是加州大学伯克利分校每个积极开发 Postgres 的人的邮件列表。

4. post_hackers@cs.berkeley.edu was the mailing list for everyone actively developing Postgres at UC Berkeley.

5 . “开源事实:Postgres 从 Sun/SPARC 到除 Ultrix/mips 和 OSF1/Alpha 之外的其他东西(Solaris/x86、Linux/x86、HP-UX/PRISM、AIM/power)的每个端口都是由外部用户。这不仅仅是进行“自动配置”;而是进行“自动配置”。为此,大多数人必须用汇编语言手动实现互斥(自旋锁)原语,因为当时还没有稳定的线程库。他们还必须弄清楚如何在操作系统上动态加载目标文件,因为这是可扩展性(功能管理器、DataBlade)功能所必需的。”—Paul Aoki

5. “Open source factoid: Every port of the Postgres from Sun/SPARC to something else (Solaris/x86, Linux/x86, HP-UX/PRISM, AIM/power) other than Ultrix/mips and OSF1/Alpha, was first done by external users. This wasn’t just doing an ‘autoconf’; to do so, most had to hand-implement mutex (spinlock) primitives in assembly language, because these were the days before stable thread libraries. They also had to figure out how to do dynamic loading of object files on their OS, since that was needed for the extensibility (function manager, DataBlade) features.”—Paul Aoki

6 . 请参阅http://www.paulaoki.com/.admin/pgapps.html了解更多详细信息。

6. See http://www.paulaoki.com/.admin/pgapps.html for more details.

7 . heisenbug 是一种软件错误,当人们尝试研究它时,它似乎会消失或改变其行为。

7. A heisenbug is a software bug that seems to disappear or alter its behavior when one attempts to study it.

8 . “Postgres 一直是开源的,因为源代码是可用的,但到目前为止,贡献者都是加州大学伯克利分校的学生。对 PostgreSQL 长寿的最大影响是将开发转移到伯克利之外的开源社区。”—Andrew Yu,Postgres 团队成员

8. “Postgres has always been open source in the sense that the source code is available but up to this point, the contributors are all UC Berkeley students. The biggest impact that contributed to the longevity of PostgreSQL is transitioning the development to an open source community beyond Berkeley.”—Andrew Yu, Postgres team member

9 . “Mike 支持其常规发布以及在 BSD 下发布。如果 Postgres95 采用 GPL 或双重许可(如 MySQL),那么它将会走上一条截然不同的历史道路。”——Jolly Chen,Postgres 团队成员

9. “Mike supported its general release as well as releasing it under BSD. Postgres95 would have taken a very different historical path if it was GPL or dual-licensed (a la MySQL).”—Jolly Chen, Postgres team member

10 . “我于 1996 年 4 月发出了这封电子邮件。该帖子仍然由 PostgreSQL 历史学家存档。当最初的贡献者失去兴趣并停止工作时,开源项目通常会消亡。”——Jolly Chen

10. “I sent out the email in April 1996. The thread is still archived by PostgreSQL historians. Often open-source projects die when initial contributors lose interest and stop working on it.”—Jolly Chen

26

26

Aurora/Borealis/StreamBase 代码线:三个系统的故事

The Aurora/Borealis/ StreamBase Codelines: A Tale of Three Systems

内西姆·塔特布尔

Nesime Tatbul

StreamBase Systems, Inc.(现为 TIBCO StreamBase)是 Michael Stonebraker 于 2000 年代初期从加州大学伯克利分校转到麻省理工学院后与他人共同创立的第一家初创公司。与 Mike 的其他初创公司一样,StreamBase 最初是一个名为 Aurora 的学术研究项目。1 Aurora 是最早构建的数据流处理系统之一,是波士顿地区三所大学(麻省理工学院、布朗大学和布兰代斯大学)数据库研究小组密切合作的成果。它还标志着这些团体之间高效、持久的伙伴关系时期的开始,这种伙伴关系一直持续到今天,有几个成功的联合研究项目和初创公司(见图 26.1;第 27章和28

StreamBase Systems, Inc. (now TIBCO StreamBase) was the first startup company that Michael Stonebraker co-founded after moving from UC Berkeley to MIT in the early 2000s. Like Mike’s other start-ups, StreamBase originated as an academic research project, called Aurora.1 One of the first data stream processing systems built, Aurora was the result of close collaboration among the database research groups of three Boston-area universities: MIT, Brown, and Brandeis. It also marked the beginning of a highly productive, long-lasting period of partnership among these groups that has continued to this day, with several successful joint research projects and startup companies (see Figure 26.1; Chapters 27 and 28).

随着Aurora转向商业领域,学术研究继续全速推进Borealis 2分布式流处理系统。基于 Aurora 代码库(提供核心流处理功能)与MIT 网络研究小组的Medusa 3代码库(提供分发功能)的合并,Borealis 在 Mike 的领导下又推动了五年的强有力合作。几年后,大约在 Stream-Base 被 TIBCO Software 收购的同时,通常的嫌疑人再次联手构建一个用于事务流存储和处理的新颖系统 S-Store。4

As Aurora was transferred to the commercial domain, the academic research continued full steam ahead with the Borealis2 distributed stream processing system. Based on a merger of the Aurora codebase (providing core stream processing functionality) with the Medusa3 codebase from MIT’s networking research group (providing distribution functionality), Borealis drove five more years of strong collaboration under Mike’s leadership. Years later, around the same time as Stream-Base was being acquired by TIBCO Software, the usual suspects would team up again to build a novel system for transactional stream storage and processing, S-Store.4

图像

图 26.1   Michael Stonebraker 的流媒体系统时间线。

Figure 26.1  Michael Stonebraker’s Streaming Systems Timeline.

本章提供了 Aurora/Borealis 和 StreamBase 团队成员的故事集,他们是 Stonebraker 流媒体系统背后的研究和开发及其宝贵贡献的第一手见证人。

This chapter provides a collection of stories from members of the Aurora/Borealis and StreamBase teams, who were first-hand witnesses to the research and development behind Stonebraker’s streaming systems and his invaluable contributions.

Aurora/Borealis:流处理系统的黎明

Aurora/Borealis: The Dawn of Stream Processing Systems

在 Debian Linux、XEmacs、CVS、Java 和 C++ 下进行开发:从新罕布什尔州到罗德岛州往返两小时通勤的免费 G​​as:15 美元

Developing under Debian Linux, XEmacs, CVS, Java, and C++: free Gas for a round-trip, two-hour commute from New Hampshire to Rhode Island: $15

资助 4 名教授、14 名研究生和 4 名本科生六个月:250,000 美元

Funding 4 professors, 14 graduate students, and 4 undergrads for six months: $250,000

让数据流处理系统在六个月内处理 100 GB:无价!

Getting a data stream processing system to process 100 gigabytes in six months: priceless!

——布朗大学[2002]

—Brown University [2002]

布朗大学部门时事通讯文章中有关 Aurora 项目的开场白概括地概括了这一切是如何开始的。迈克抵达东海岸、开始在新罕布什尔州生活并在麻省理工学院工作后不久,他和布朗大学的斯坦·兹多尼克取得了联系,发现了他们的共同研究兴趣。那是在从手机到冰箱等一切东西变得如此智能和互联之前,但是未来学家已经在谈论普适计算、传感器网络和基于推送的数据等事​​物。Mike 和 Stan 意识到,传统的数据库系统无法扩展以满足新兴应用程序类别的需求,这些应用程序需要对快速且不可预测的数据流提供低延迟监控功能。他们很快就成立了一个项目团队,其中包括两位年轻的数据库教授 Ugur Çetintemel 和 Mitch Cherniack,以及布朗大学和布兰代斯大学的几名研究生。我当时是二年级博士生。当时布朗数据库小组的学生,刚刚完成我的课程作业,正在寻找一个好的论文主题来研究,但还没有完全意识到我是一名幸运的研究生,碰巧在正确的地方在正确的地方时间。

This opening quote from a Brown departmental newsletter article about the Aurora project captures in a nutshell how it all got started. Shortly after Mike had arrived on the East Coast, started living in New Hampshire, and working at MIT, he and Stan Zdonik of Brown got in touch to discover their joint research interests. This was way before everything from phones to refrigerators got so smart and connected, but futurists had already been talking about things like pervasive computing, sensor networks, and push-based data. Mike and Stan realized that traditional database systems would not be able to scale to the needs of an emerging class of applications that would require low-latency monitoring capabilities over fast and unpredictable streams of data. In no time, they set up a project team that included two young database professors, Ugur Çetintemel and Mitch Cherniack, as well as several grad students from Brown and Brandeis. I was a second-year Ph.D. student in the Brown Database Group at the time, just finishing off my coursework, in search of a good thesis topic to work on, and not quite realizing yet that I was one lucky grad student who happened to be in the right place at the right time.

布朗大学的第一次头脑风暴会议启动了一系列其他会议,从头开始设计新的数据管理系统并对其进行原型设计,并在此过程中解决新的研究问题。需要弄清楚的基本问题之一是数据和查询模型。什么是数据流以及如何表达对它的查询?每个人的想法都略有不同。经过多次激烈争论,我们最终集中在 SQuAl(Aurora [S]tream [Qu]ery [Al]gebra)上。SQuAl本质上由关系运算符的流模拟和几个特定于流的构造(例如,滑动窗口)和运算符(例如,重新采样)组成,并通过用户定义的运算符支持可扩展性。在 Aurora 术语中,运算符用“框”表示,“框”与表示元组队列(如数据库行)的“箭头”连接,一起组成“查询网络”。我们准备开始实施我们的第一个连续查询。嗯,差不多了……

The first brainstorming meeting at Brown kicked off a series of others for designing and prototyping a new data management system from scratch and tackling novel research issues along the way. One of the fundamental things to figure out was the data and query model. What was a data stream and how would one express queries over it? Everyone had a slightly different idea. After many hot debates, we finally converged on SQuAl (the Aurora [S]tream [Qu]ery [Al]gebra). SQuAl essentially consisted of streaming analogs of relational operators and several stream-specific constructs (e.g., sliding windows) and operators (e.g., resample), and supported extensibility via user-defined operators. In Aurora terminology, operators were represented with “boxes,” which were connected with “arrows” representing queues of tuples (like database rows), together making up “query networks.” We were ready to start implementing our first continuous queries. Well, almost …

作为系统专业的学生,​​我们都是充满信心的程序员,对这个项目充满热情,但从头开始构建一个全新的数据库系统看起来比我们以前做过的任何事情都要复杂得多。从哪里开始呢?迈克知道。要做的第一件事是使用 BerkeleyDB 实现保存系统组件所需元数据的目录。然后,我们实现了基本原语的数据结构,例如流、元组队列和窗口以及 SQuAl 运算符的框。接下来实施执行框的调度程序。这些核心系统组件是用 C++ 实现的。最初,系统设置和要运行的工作负载(即查询网络)只是通过基于 XML 的文本配置文件来指定。后来我们添加了一个基于 Java 的 GUI,使用户可以更轻松地使用拖放框和箭头界面构建查询网络并管理其执行。图形规范也将在幕后转换为文本规范。我们以可视化方式定义数据流的程序方法使 Aurora 与其他具有基于 SQL 的声明性前端的系统区分开来。随着时间的推移,还添加了其他可视化工具,以促进系统监控和各种高级功能的演示(例如,服务质量 [QoS] 规范和跟踪、监控元组队列和系统负载)。我们以可视化方式定义数据流的程序方法使 Aurora 与其他具有基于 SQL 的声明性前端的系统区分开来。随着时间的推移,还添加了其他可视化工具,以促进系统监控和各种高级功能的演示(例如,服务质量 [QoS] 规范和跟踪、监控元组队列和系统负载)。我们以可视化方式定义数据流的程序方法使 Aurora 与其他具有基于 SQL 的声明性前端的系统区分开来。随着时间的推移,还添加了其他可视化工具,以促进系统监控和各种高级功能的演示(例如,服务质量 [QoS] 规范和跟踪、监控元组队列和系统负载)。

As systems students, we were all confident coders and highly motivated for the project, but building a whole new database system from the ground up looked way more complex than anything we had ever done before. Where does one even start? Mike knew. The first thing to do was to implement the catalog that would hold the metadata needed by systems components, using BerkeleyDB. We then implemented the data structures for basic primitives such as streams, tuple queues, and windows as well as the boxes for SQuAl operators. The scheduler to execute the boxes was implemented next. These core systems components were implemented in C++. Initially, system settings and workloads to run (i.e., query networks) were simply specified by means of an XML-based, textual configuration file. Later we added a Java-based GUI for making it easier for users to construct query networks and manage their execution using a drag-and-drop boxes-and-arrows interface. The graphical specification would also be converted into the textual one under the covers. Our procedural approach to visually defining dataflows was what set Aurora apart from other systems that had SQL-based, declarative front-ends. Other visual tools were added over time to facilitate system monitoring and demonstration of various advanced features, (e.g., Quality of Service [QoS] specification and tracking, monitoring tuple queues and system load).

图像

图 26.2  布朗计算机科学图书馆室举行的 Aurora 研究会议,2002 年秋季。从左到右,顶行:Adam Singer、Alex Rasin、Matt Hatoun、Anurag Maskey、Eddie Galvez、Jeong-Hyon Hwang 和 Ying Xing;底排:克里斯蒂娜·欧文、克里斯蒂安·康维、迈克尔·斯通布雷克、罗宾·严、斯坦·兹多尼克、唐·卡尼和内西姆·塔特布尔。

Figure 26.2  Aurora research meeting in the Brown Computer Science Library Room, Fall 2002. Left to right, top row: Adam Singer, Alex Rasin, Matt Hatoun, Anurag Maskey, Eddie Galvez, Jeong-Hyon Hwang, and Ying Xing; bottom row: Christina Erwin, Christian Convey, Michael Stonebraker, Robin Yan, Stan Zdonik, Don Carney, and Nesime Tatbul.

在第一年,我们既构建了初始系统原型,又试图确定关键的研究问题。最重要的是构建一个功能系统并发布我们的第一篇设计论文。起初,研究生们并不知道每个人将研究哪些研究问题。每个人都专注于设计系统的第一个工作版本。这是非常值得的,因为我们一路走来学到了很多关于如何构建真正的数据管理系统、如何找出关键系统问题所在以及创建实验研究平台的知识。项目进行大约六个月后,我记得布朗大学的一次头脑风暴会议,主要研究主题列在黑板上,研究生被问及他们的兴趣。那天之后,我专注于减载作为我的研究主题,这最终成为我博士学位的主题。论文。

During the first year, we were both building the initial system prototype and trying to identify the key research issues. The highest priority was to build a functional system and publish our first design paper. At first, the grad students did not know which research problems each would be working on. Everyone focused on engineering the first working version of the system. This was well worth the effort, since we learned a great deal along the way about what it took to build a real data management system, how to figure out where the key systems problems lay, as well as creating a platform for experimental research. After about six months into the project, I remember a brainstorming session at Brown where major research topics were listed on the board and grad students were asked about their interests. After that day, I focused on load shedding as my research topic, which eventually became the topic of my Ph.D. dissertation.

到第一年年底,我们的第一篇研究论文进入了 VLDB'02 [Carney 等人。2002]。我记得一位教授发来一封热情洋溢的电子邮件,其中写道:“恭喜大家,我们已经成功了!” 这篇论文后来被选发表在 VLDB 期刊的特刊上,并与当年 VLDB 的最佳论文一起发表 [Abadi 等人。2003b]。根据此后不久的后续出版物,Aurora 此时已经是一个“拥有 75K 行 C++ 代码和 30K 行 Java 代码的操作系统”[Zdonik 等人,2017]。2003]。第二年年底,毕业学生们开始发表关于他们个人研究主题(例如,减载、调度)的第一篇论文,我们在 SIGMOD'03, 5上展示了系统的演示,并发布了我们的第一个公共代码。在某个时候,可能是在发布前不久,该项目聘请了一名全职软件工程师来帮助我们管理不断增长的代码库、组织和清理代码、创建适当的文档等。

By the end of year one, our first research paper got into VLDB’02 [Carney et al. 2002]. I remember an enthusiastic email from one of the professors saying something like, “Congratulations folks, we are on the map!” This paper was later selected to appear in a special issue of the VLDB Journal with best papers from that year’s VLDB [Abadi et al. 2003b]. According to a subsequent publication shortly thereafter, Aurora was already an “operational system with 75K lines of C++ code and 30K lines of Java code” by this time [Zdonik et al. 2003]. By the end of the second year, grad students started publishing the first papers on their individual research topics (e.g., load shedding, scheduling), we presented a demo of the system at SIGMOD’03,5 and did our first public code release. At some point, probably shortly before the release, a full-time software engineer was hired to the project to help us manage the growing codebase, organize and clean up the code, create proper documentation, etc.

随着 Aurora 转向商业领域,大学团队将注意力转向其分布式版本 Borealis。我们探索了高可用性、容错和动态负载均衡等新的研究主题,并发表了一系列 SIGMOD/VLDB/ICDE 论文 [Ahmad et al. 2017]。2005,巴拉津斯卡等人。2004b,2005,Hwang 等人。2005,Xing 等人。2005]。几年之内,我们为 Borealis 构建了一个全面的系统原型,并在 SIGMOD'05 6上获得了最佳演示奖(这与 Mike 的 IEEE John von Neumann 奖章庆祝活动同一年)。有关Aurora/Borealis 背后的研究工作的更多详细信息,请参阅第 17 章和第 26章。

As Aurora moved to the commercial space, the university team switched attention to its distributed version, Borealis. We explored new research topics such as high availability, fault tolerance, and dynamic load balancing, and published a series of SIGMOD/VLDB/ICDE papers [Ahmad et al. 2005, Balazinska et al. 2004b, 2005, Hwang et al. 2005, Xing et al. 2005]. Within a couple of years, we built a comprehensive system prototype for Borealis, which won the best demo award at SIGMOD’056 (this was the same year as Mike’s IEEE John von Neumann Medal celebration event). Further details on the research work behind Aurora/Borealis can be found in Chapters 17 and 26.

我们并不是当时唯一从事流媒体研究的团队。还有其他几个领导小组,包括斯坦福大学的 STREAM 团队(由 Jennifer Widom 领导)、加州大学伯克利分校的 Telegraph 团队(由 Mike Franklin 和 Joe Hellerstein 领导)以及俄勒冈研究生院的 Punctuated Streams 团队(由 Dave Maier 领导) )。我们与这些团队进行了非常密切的互动,以友好的方式相互竞争,同时也在联合活动中聚在一起,就该领域的总体方向交换最新信息和想法。我记得大多数团队都广泛参加了这样一个活动:2003 年在加利福尼亚州阿西洛玛举行的 CIDR'03 会议之后,在斯坦福大学举行的 Stream 冬季会议 (SWiM)。7这种互动有助于围绕流媒体形成更大的社区,并提高了我们集体研究的影响力。

We were not the only team working on streaming research in those days. There were several other leading groups, including the STREAM Team at Stanford (led by Jennifer Widom), the Telegraph Team at UC Berkeley (led by Mike Franklin and Joe Hellerstein), and the Punctuated Streams Team at Oregon Graduate Institute (led by Dave Maier). We interacted with these teams very closely, competing with one another on friendly terms but also getting together at joint events to exchange updates and ideas on the general direction of the field. I remember one such event that was broadly attended by most of the teams: the Stream Winter Meeting (SWiM) held at Stanford in 2003, right after the CIDR’03 Conference in Asilomar, CA.7 Such interactions helped form a larger community around streaming as well as raised the impact of our collective research.

Aurora/Borealis 是一个重要的团队合作成果,涉及多所大学的大量具有不同技能、目标和参与程度的学生开发人员和研究人员。在迈克和其他教授的远见和领导下,这些项目代表了独特的例子,说明了规模有多大系统研究团队可以齐心协力,高效地创建一个远大于各部分之和的整体。迈克的精力和奉献精神是我们团队中所有人的巨大灵感源泉。他不断地挑战我们构建新颖且实用的解决方案,从不让我们忽视现实世界。

Aurora/Borealis was a major team effort that involved a large number of student developers and researchers with different skills, goals, and levels of involvement in the project across multiple universities. Under the vision and leadership of Mike and the other professors, these projects represent unique examples of how large systems research teams can work together and productively create a whole that is much bigger than the sum of its parts. Mike’s energy and dedication was a great source of inspiration for all of us on the team. He continuously challenged us toward building novel but also practical solutions, never letting us lose sight of the real world.

截至 2008 年夏季,Borealis 最终公开发布时,所有 Aurora/Borealis 博士。学生们已经毕业,其中七人担任教职,Mitch 和 Ugur 已晋升为终身教授,StreamBase 已经结束了 C 轮融资并开始创收。迈克和斯坦?他们长期以来一直在致力于下一次大冒险(参见第 18 章和27章)。

By the final public release of Borealis in summer 2008, all Aurora/Borealis Ph.D. students had graduated, with seven of them undertaking faculty positions, Mitch and Ugur had been promoted to tenured professors, and StreamBase had already closed its Series C funding and started generating revenue. Mike and Stan? They had long been working on their next big adventure (see Chapters 18 and 27).

从 10 万多行大学代码到商业产品

From 100K+ Lines of University Code to a Commercial Product

Richard Tibbetts 是 Mike 在 MIT 的第一批研究生之一,见证了 StreamBase 从成立到被 TIBCO 收购的整个过程。完成他的工程学硕士学位后。与 Mike 合作开发线性路流数据管理基准的论文 [Arasu 等人。2004年],Richard参与Aurora的商业化,首先作为四位工程师联合创始人之一,后来担任StreamBase的CTO。他在这里回忆了公司成立初期的情况:

Richard Tibbetts, one of Mike’s first grad students at MIT, witnessed the complete StreamBase period from its inception to the TIBCO acquisition. After finishing his M.Eng. thesis with Mike on developing the Linear Road Stream Data Management Benchmark [Arasu et al. 2004], Richard got involved in the commercialization of Aurora, first as one of the four engineer co-founders, and later on as the CTO of StreamBase. He reminisces here about the early days of the company:

最初,我们称该公司为“DBstream”,但后来商标律师提出了抱怨。然后,受到迈克在新罕布什尔州度假屋附近的池塘的启发,我们在那里做了一次场外活动,我们想将其称为“草池塘”。哎呀,阿肯色州有一家电脑公司叫这个名字!我们可以选择其他水体吗?无论如何,我们稍后会更改它......因此成为公司的第一个正式名称:“Grassy Brook”。迈克的女儿[莱斯利]创造了这个标志。2004 年晚些时候,Bill Hobbib [营销副总裁] 将其重命名为“StreamBase”,这是一个更强大的名称,他和其他人将其打造成一个非常知名的品牌。

Initially, we called the company ‘DBstream,’ but then the trademark attorneys complained. Then inspired by the pond near Mike’s vacation house in NH where we did an offsite, we wanted to call it ‘Grassy Pond.’ Oops, there was a computer company in Arkansas with that name! Could we pick some other body of water? We were going to change it later anyway … And so became the first official name of the company: ‘Grassy Brook.’ Mike’s daughter [Lesley] created the logo. Later in 2004, Bill Hobbib [VP of marketing] would rename it to ‘StreamBase,’ a stronger name which he and others would build into a very recognizable brand.

我们整理了非常粗略的融资方案,并将其呈现给波士顿地区的几位风险投资家。我们几乎没有任何关于商业化的内容,更多的是寻找问题的解决方案,但我们确实有一个可以演示的 UI(用户界面)。这个视觉工具确实赢得了人们的青睐。我花了很多年才真正成为可视化编程环境的粉丝。但事实上,图形结构应用程序是一种更正确的方式来看待我们正在做的事情:一种自然的“白板”方式来表示应用程序逻辑。最终,该领域的大多数竞争对手都复制了我们正在做的事情,将可视化语言添加到他们的类似 SQL 的方法中。即使在没有可视化环境的工具中,API 也变得面向图拓扑。StreamBase 在可视化编程方面投入了大量资金,添加了模块化、差异/合并、调试、以及其他始终以视觉方式呈现的功能。用户有时持怀疑态度起初,爱上了视觉环境。多年后,在我们出售公司、整合技术以及我离开之后,视觉环境仍然作为 StreamBase 的主要功能和卖点。

We put together very rough pitch decks and presented them to a handful of Boston-area VCs. We had hardly anything in there about commercialization, really much more a solution in search of a problem, but we did have a UI (user interface) we could demonstrate. The visual tool really won people over. It took me years to really be a fan of the visual programming environment. But it was the case that graph-structured applications were a more correct way to look at what we were doing: a natural, “whiteboard” way of representing application logic. Eventually, most of the competitors in the space copied what we were doing, adding visual languages to their SQL-like approaches. Even in tools that didn’t have a visual environment, APIs became graph-topology-oriented. StreamBase invested deeply in visual programming, adding modularity, diff/merge, debugging, and other capabilities in ways that were always visual native. Users, sometimes skeptical at first, fell in love with the visual environment. Years later, after we had sold the company, integrated the tech, and I had left, that visual environment lived on as a major capability and selling point for StreamBase.

在最初的四个月里,从 2003 年 9 月到 2004 年 1 月,我们与学术界使用相同的代码库,让事情变得更好并围绕它进行构建。但在某些时候,需求出现分歧,分叉代码是合适的。我们对代码转储进行了 BSD 许可,并将其发布在 MIT,然后在公司下载了一份副本,并从那里使用了我们自己的代码存储库。

For the first four months, from September 2003 to January 2004, we were in the same codebase as academia, making things better and building around. But at some point, needs diverged and it was appropriate to fork the code. We BSD-licensed a dump of the code and published it at MIT, then downloaded a copy at the company, and went from there with our own code repository.

我们做的第一件事就是构建一个测试工具。回归和正确答案对行业来说非常重要,甚至比性能和架构更重要。从长远来看,大多数学术代码都无法生存。它将被逐步取代,“忒修斯之船”风格。一些架构和数据结构将保留。我们必须确保我们喜欢它们,因为随着时间的推移,它们会变得更难改变。博士 这些通常代表复杂性超出严格必要范围的代码区域,因此我们开始删除它们。此外,任何具有三到五个不同组件实现的东西都是高度可疑的,因为冗余级别的灵活性和维护所有这些实现的更高成本。我们要么选择最简单的一个,要么选择一个关键的差异化因素。

One of the first things we did was to build a test harness. Regressions and right answers matter a lot to industry—even more than performance and architecture. Most of the academic code wouldn’t survive in the long term. It would get replaced incrementally, ‘Ship of Theseus’ style. Some of the architecture and data structures would remain. We had to make sure we liked them, because they would get harder to change over time. Ph.D. theses typically represented areas of code where complexity exceeded what was strictly necessary, so we started deleting them. Also, anything with three to five distinct implementations of a component was highly suspect, due to the redundant level of flexibility and higher cost to maintain all of those implementations. We either picked the simplest one or the one that was a key differentiator. Deleting code while maintaining functionality was the best way to make the codebase more maintainable and the company more agile.

与 StreamBase 客户的邂逅

Encounters with StreamBase Customers

John Partridge 是 StreamBase 的联合创始人,也是该公司最初的营销副总裁。John 曾在公司从事业务开发和产品营销工作,他分享了以下有关 Mike 与 StreamBase 客户互动的轶事:

John Partridge was a co-founder of StreamBase and its initial VP of marketing. Having been on the business development and product marketing side of the company, John shares the following anecdote about Mike’s interaction with StreamBase customers:

我们知道投资银行和对冲基金非常关心以最小的延迟处理实时市场数据,因此我们瞄准了所有明显的大牌银行:高盛、摩根士丹利等。这些银行一直在寻求竞争优势和他们对新技术的渴望是众所周知的。众所周知,管理交易部门的董事总经理非常自负,但他们正是我们想要结识的人。让四个人在同一个房间进行一小时的演讲需要花费数周甚至数月的时间来安排,因为他们都希望被视为最稀缺、因此最有价值的参加者。

We knew that investment banks and hedge funds cared a lot about processing real-time market data with minimal latency and so we targeted all the obvious big-name banks: Goldman Sachs, Morgan Stanley, etc. Those banks were always looking for competitive advantage and their appetite for new technology was widely known. What was also widely known was that the egos of the managing directors who managed the trading desks were enormous, but these were the people we were trying to meet. Getting four of them in the same room for a one hour presentation took weeks, sometimes months, to schedule because they all wanted to be viewed as the scarcest, and hence most valuable, person to attend.

不管怎样,最后我们安排了来自一家领先投资银行的四位高层人士来听 Mike 进行 StreamBase 推介。迈克和我出席了会议,其中三个人也在场。没有人知道第四个人在哪里。他们都穿着剪裁无可挑剔的西装,系着权力领带,戴着昂贵得离谱的模拟手表。我穿着我拥有的两套西装中的一套,感觉自己完全落后了。迈克穿着红色羊毛夹克,里面穿着 Polo 衫和露趾凉鞋。与这些家伙没有闲聊;他们都是生意。我按照惯例进行五分钟的介绍,然后迈克接手他的演讲。首先是技术问题,来自商界人士的问题非常有洞察力。它们也很尖锐,可能是因为迈克喜欢制定替代方法,包括这家银行的方法,就像“只有白痴才会做的事情”。所以他们对迈克来说是一场真正的拷问。然后,20 分钟后,第四位总经理走进来,没有打招呼,也没有为迟到道歉,甚至没有自我介绍;他只是坐下来,看着屏幕,忽略了那个穿凉鞋的家伙。迈克毫不犹豫地对迟到的人大声说道:“你迟到了。” 一位终身教授说出这些话的方式会让所有大学毕业的人感到恐惧。迟到的人不仅退缩了,反射性地摆脱了迈克枯萎的目光,而且你还可以看到其他三位董事总经理在一阵共同的笑声中咧着嘴笑。幸灾乐祸。在那一刻,一场艰难的商务会议变成了一场大学讲座,教授解释了世界是如何运作的,其他人只是听。

Anyhow, at last we had four of the top people from a leading investment bank scheduled to hear Mike give the StreamBase pitch. Mike and I show up for the meeting and three of them are there; no one knows where the fourth guy is. They’re all wearing the impeccable tailored suits, power ties, and ludicrously expensive analog watches. I’m wearing one of the two business suits I own and feeling completely outgunned. Mike is wearing his red fleece jacket, polo shirt underneath, and open-toed sandals. There’s no small talk with these guys; they are all business. I do the customary five minutes of introductions, then Mike takes over with his pitch. The technical questions start and, coming from business people, the questions are pretty insightful. They’re also pretty pointed, probably because Mike likes to frame alternative approaches, including this bank’s, as things ‘only an idiot would do.’ So it’s a real grilling they’re giving Mike. Then, 20 minutes into it, the fourth managing director walks in, doesn’t say hello, doesn’t apologize for being late, doesn’t even introduce himself; he just takes a seat and looks at the screen, ignoring the guy in the sandals. Mike, without missing a beat, turns on the late arrival and declaims, ‘You’re late.’ There’s something about the way a tenured professor can say those words that strikes terror in the heart of anyone who has ever graduated college. Not only did the late arrival flinch and reflexively break away from Mike’s withering stare, but you could see the other three managing directors grinning through a wave of shared schadenfreude. At that moment, what had been a tough business meeting became a college lecture where the professor explained how the world worked and everyone else just listened.

StreamBase 中的“Over My Dead Body”问题

“Over My Dead Body” Issues in StreamBase

理查德·蒂贝茨(Richard Tibbetts)让我们想起了迈克(Mike)说过的那句臭名昭著的话,任何与他共事的人都可能至少听过他说过一次:

Richard Tibbetts reminds us of the infamous phrase from Mike that whoever worked with him has probably heard him say at least once:

迈克喜欢根据经验采取非常坚定的立场,并挑战人们克服这些立场。有时,这些会引出熟悉的短语“Over My Dead Body (OMDB)”。然而,事实证明,迈克可能会受到持续的压力和客户申请的影响。

Mike was fond of taking very strongly held positions, based on experience, and challenging people to overcome them. Occasionally these would elicit the familiar phrase ‘Over My Dead Body (OMDB).’ However, it turned out that Mike could be swayed by continued pressure and customer applications.

John Partridge 记得 Stream-Base 早期的一个这样的 OMDB 问题:

John Partridge remembers one such OMDB issue from the early days of Stream-Base:

StreamBase流处理引擎最初是作为解释器实现的。这是当时构建数据库查询执行器的标准方法,并且它使得添加新运算符和交换运算符的替代实现变得更加容易。我想大约八九个月后,引擎的首席工程师 Jon Salz 和 Richard Tibbetts 开始意识到运行解释器对性能的影响是灾难性的。他们对这个问题研究得越多,就越相信使用即时 Java 编译器会运行得更快,并且仍然提供一种机制动态交换应用程序逻辑块。迈克对此一无所知,这成为了几个月来“关于我的尸体”的问题。最后,我们的首席执行官给 Jon Salz 和 Richard Tibbetts 两个月的时间来启动并运行实验版本,从而解决了争论。迈克认为这完全是浪费宝贵的开发人员时间,但同意只是为了维持和平。Jon 和 Richard 提前一周完成了,性能优势是压倒性的。迈克对结果印象深刻,当然还有乔恩和理查德。我们切换到 Java JIT 编译器并且再也没有回头。

The StreamBase stream processing engine was originally implemented as an interpreter. This was the standard way to build database query executors at the time, and it made it easier to add new operators and swap in alternative implementations of operators. I think we were about eight or nine months in when the lead engineers on the engine, Jon Salz and Richard Tibbetts, began to realize that the performance hit for running the interpreter was disastrous. The more they looked at the problem, the more they believed that using a just-in-time Java compiler would run much faster and still provide a mechanism for swapping out chunks of application logic on the fly. Mike would have none of this and it became an ‘Over My Dead Body’ issue for months. Finally, our CEO resolved the debate by giving Jon Salz and Richard Tibbetts two months to get an experimental version up and running. Mike viewed this as a complete waste of precious developer time, but agreed just to keep the peace. Jon and Richard finished it a week early and the performance advantages were overwhelming. Mike was hugely impressed with the results and, of course, with Jon and Richard. We switched over to the Java JIT compiler and never looked back.

Richard Tibbetts 添加了同一时期的以下详细信息,以及如何应对 Mike 提出的挑战,带领工程团队开发出新的改进版本的 StreamBase 引擎并将其交付给客户:

Richard Tibbetts adds the following details from the same time period, and how taking up the challenge raised by Mike led the engineering team to a new and improved version of the StreamBase engine to be shipped to the customers:

教授们花费大量时间挑战研究生做不可能的事情,期望他们只在一小部分时间里取得成功,这带来了很多科学进步。在 StreamBase,架构委员会是 Mike、其他一些教授、工程创始人和其他一些高级工程师聚集在一起讨论正在构建的内容以及如何构建的地方。这些会议经常向工程部门提出挑战,要求他们建造一些不可能的东西或证明一些东西。

Professors spend a lot of time challenging graduate students to do impossible things, expecting them to be successful only a fraction of the time, and this yields a lot of scientific advancement. At StreamBase, the Architecture Committee was where Mike, some of the other professors, the engineering founders, and some of the other senior engineers came together to discuss what was being built and how it was being built. These meetings regularly yielded challenges to Engineering to build something impossible or to prove something.

我记得的第一个例子发生在公司成立之初,当时乔恩·萨尔茨(Jon Salz)想要完全取代在大学开发的用于编辑图形查询的用户界面。Mike 和 Hari [Balakrishnan] 认为开始修改它并随着时间的推移逐步改进它的风险会更低。乔恩声称他可以构建一个更好的用户界面,并且更容易改进,他们要求他在两周内完成。他交付了,这成为我们商业 GUI 的基础,也成为我们进入基于 Eclipse 的开发环境的基础,这在以后的版本中启用了大量功能。这也是公司的第一个 Java 代码。

The first instance I recall came very early in the life of the company, when Jon Salz wanted to completely replace the user interface for editing graphical queries that had been developed in the university. Mike and Hari [Balakrishnan] thought it would be lower risk to begin modifying it and incrementally improve it over time. Jon asserted he could build a better UI that would be easier to improve going forward, and they challenged him to do it in two weeks. He delivered, and that became the basis for our commercial GUI, and also our move into Eclipse-based development environments, which enabled a vast array of capabilities in later versions. It was also the first Java code in the company.

另一个可能更具影响力的例子是,当人们意识到原来的架构不适合市场​​需求时。StreamBase 1.0 与 Aurora 一样,将查询作为处理节点和队列的集合进行管理,并使用调度程序根据队列大小和可用资源分派工作来优化多线程执行。处理节点是用 C++ 实现的,使该系统成为一种查询解释器。

Another instance, possibly even more impactful, came as it became clear that the original architecture wasn’t a fit for what the market needed. StreamBase 1.0, like Aurora, managed queries as collections of processing nodes and queues, with a scheduler optimizing multi-threaded execution by dispatching work based on queue sizes and available resources. The processing nodes were implemented in C++, making this system a sort of interpreter for queries.

事实证明,执行性能主要由排队和调度的成本决定。Jon Salz 提出了一种替代方法,代号为“SB2”,是 StreamBase 的第二个版本。在 Jon 的提议中,查询将被编译为 Java 字节码并由系统执行,而不需要任何排队或调度。这也将极大地改变图形查询的执行语义,使它们更容易推理,因为数据每次都会在相同的路径中流动。

It turned out that execution performance was dominated by the cost of queuing and scheduling. Jon Salz proposed an alternative approach, codenamed “SB2” for being the second version of StreamBase. In Jon’s proposal, queries would be compiled to Java byte-code and executed by the system without requiring any queuing or scheduling. This would also dramatically change the execution semantics of the graphical queries, making them much easier to reason about, since data would flow in the same path every time.

当然,这也将使该系统完全不同于高性能流媒体的学术方法。Mike 对用 Java 构建数据库技术的想法感到震惊,他认为这种方法很慢。乔恩有信心他可以做得更快,更快。因此,迈克要求他证明这一点。几个月后,Jon 和我就有了一个原型,它的吞吐量比现有实现高出 3-10 倍,而且延迟更显着降低。Mike 很高兴地承认了这一点,StreamBase 3.0 附带了新的更快的引擎,并且客户升级几乎是无缝的。

Of course, it would also make the system completely different from the academic approach to high performance streaming. And Mike was appalled at the idea of building database tech in Java, which he believed was slow. Jon was confident that he could make it faster, much faster. So, Mike challenged him to prove it. In a couple of months Jon and I had a prototype, which was 3–10 times higher throughput than the existing implementation, and even more dramatically lower latency. Mike happily conceded the point, and StreamBase 3.0 shipped with the new faster engine, with a nearly seamless customer upgrade.

Richard Tibbetts 回忆起另外两个 OMDB 情况,客户的要求和工程师的辛勤工作克服了 Mike 的反对:向 StreamBase 添加循环和嵌套数据支持。

Richard Tibbetts recalls two additional OMDB situations, where customer requirements and hard work of engineers overcame oppositions from Mike: adding looping and nested data support to StreamBase.

在处理图中,循环具有计算意义:消息可能相互反馈的循环。然而,这与 SQL 处理模型不一致,使我们的系统声明性较差,并且承认查询可能会“旋转循环”,消耗大量 CPU 并且永远不会终止。人们讨论了在保持声明性语义的同时实现其中一些功能的方法,但最终,客户用例(例如,通过将未填充的部分发送回匹配来处理部分填充的订单)需要循环,并且它们成为核心系统的一部分。

Within a processing graph, it made computational sense to have cycles: loops where messages might feed back on one another. However, this was at odds with the SQL model of processing, and made our system less declarative, as well as admitting the possibility of queries that would ‘spin loop,’ consuming lots of CPU and never terminating. There were discussions of ways to enable some of this while maintaining declarative semantics, but in the end, customer use cases (for example, handling a partially filled order by sending the unfilled part back to match again) demanded loops, and they became a core part of the system.

对于嵌套数据,关系数据库模型表示这是一件坏事。XML 作为一种数据表示形式是一个可怕的想法,它只是 20 世纪 70 年代 CODASYL(数据系统语言会议/委员会)的翻版。对于静态数据来说可能是这样。但动态数据更像是独立的消息。StreamBase 必须与之集成的许多其他系统和协议在其消息中都有嵌套结构。随着时间的推移,StreamBase 处理越来越多的此类消息结构,而且 Mike 的健康状况从未受到任何不利影响……

For nested data, the relational database model said it was a bad thing. That XML had been a horrible idea as a data representation, just a retread of CODASYL (the Conference/Committee on Data Systems Languages) from the 1970s. This may be true for data at rest. But data in motion is more like self-contained messages. Many of the other systems and protocols StreamBase had to integrate with had nested structures in their messages. Over time, StreamBase handled more and more of these message structures, and at no point was Mike’s health adversely affected …

愚人节笑话,还是下一个大创意?

An April Fool’s Day Joke, or the Next Big Idea?

TIBCO StreamBase 的 Eddie Galvez(Aurora Brandeis 团队的前研究生,StreamBase 的四位工程师联合创始人之一)和 Richard Tibbetts 都深情地记得 StreamBase 工程师多次试图用他们的愚人节笑话来愚弄 Mike:关于 StreamBase 的一些技术主题。理查德重述了一个我觉得特别有趣的故事:

Both Eddie Galvez of TIBCO StreamBase (a former grad student from the Aurora Brandeis team and one of the four engineer co-founders of StreamBase) and Richard Tibbetts fondly remember how the StreamBase engineers tried to fool Mike several times with their April Fool’s Day jokes about some technical topic around StreamBase. Richard retells one story that I have found especially interesting:

StreamBase Engineering 有着在 4 月 1 日向公司宣布重大新产品功能的传统,通常会通过大量热情的电子邮件。有一年,Jon 宣布他用 Haskell 重写了整个系统,速度又快了十倍。迈克不止一次真诚地回应过这些问题。我特别想到一件事:

StreamBase Engineering had a tradition of announcing major new product capabilities to the company, usually with an enthusiastic mass email, on April 1st. One year, Jon announced that he had rewritten the whole system in Haskell and it was another ten times faster. On more than one occasion, Mike responded to these in good faith. One occurrence in particular comes to mind:

StreamBase 的早期版本嵌入了 MySQL 查询引擎来管理存储的数据。该软件从未上市,并且存在许多技术缺陷。这也冒犯了 Mike,因为 MySQL 并不是一个很好的数据库。然而,它被设计为可嵌入的(今天 SQLite 将是合乎逻辑的选择),所以我们尝试了一下。最终,我们使用 Sleepycat 切换到我们自己的、更有限的存储数据管理,在内存和磁盘上。但MySQL确实又出现了。

Very early versions of StreamBase had embedded a MySQL query engine to manage stored data. This software never shipped, and had many technical flaws. It also offended Mike, because MySQL was just not a very good database. However, it was designed to be embeddable (today SQLite would be the logical choice), and so we tried it out. Eventually, we switched to our own, more limited stored data management, in memory and on disk using Sleepycat. But MySQL did come up again.

2007 年 4 月 1 日星期日,Hayden Schultz 和我整个周末都在与潜在客户 Linden Lab 的基于 MySQL 的系统进行集成。我给全公司发了一封邮件,宣布除了在客户整合上取得成功外,我们对MySQL有了顿悟,列举了它的许多奇妙的能力。该电子邮件的结论是:“所有这些好处的总和是 MySQL 是所有 StreamBase 持久性的明显选择。我们应该开始探索基于 MySQL 的 StreamBase 5.0 架构,并研究如何将这项技术应用到现有的业务中。

On Sunday, April 1, 2007, Hayden Schultz and I had worked through the weekend integrating with a MySQL-based system at Linden Lab, a prospective customer. I sent an email to the whole company, announcing that in addition to succeeding at the customer integration, we had had an epiphany about MySQL, enumerating its many wonderful capabilities. The email concluded, ‘The sum of all these benefits is that MySQL is the obvious choice for all StreamBase persistence. We should begin exploring a MySQL-based architecture for StreamBase 5.0, and also look at how we can bring this technology to bear in existing engagements.’

迈克很快回应说,他有一个更好的提案,基于一个名为“Horizo​​ntica”的计划研究系统,我们应该在下一次架构委员会会议之前提出任何决定。我对 Vertica 的双关语笑出了声。迈克显然已经参加了 4 月 1 日的活动。但后来我打开了附件,这是一篇 VLDB 论文的 13 页预印本。事实上,这是 StreamBase 持久性的一种非常有趣的替代方法,也是一些非常酷的研究。该系统后来成为 H-Store/VoltDB。

Mike quickly responded saying that he had a superior proposal, based on a planned research system called ‘Horizontica,’ and we should table any decision until the next Architecture Committee meeting. I laughed out loud at the pun on Vertica. Mike had clearly gotten in on the April 1st activities. But then I opened the attachment, which was a 13-page preprint of a VLDB paper. In fact, this was an actually interesting alternative approach for StreamBase persistence, and some pretty cool research. That system would later become H-Store/VoltDB.

为理查德的故事添加一个有趣的注释,整整六年后,即 2013 年 4 月 1 日星期一,我作为高级研究科学家加入了位于麻省理工学院的英特尔大数据科学技术中心,再次与迈克和老帮派一起工作,在一个我们命名为“S-Store”的新研究项目中。S-Store 背后的想法是通过流处理功能扩展 H-Store 内存中 OLTP 数据库引擎,创建一个可扩展的系统,用于处理具有事务保证的存储和流数据 [Meehan 等人,2017]。2015b]。S-Store于2017年公开发布。

As an interesting note to add to Richard’s story, exactly six years later on Monday, April 1, 2013, I joined the Intel Science and Technology Center for Big Data based at MIT as a senior research scientist to work with Mike and the old gang again, on a new research project that we named “S-Store.” The idea behind S-Store was to extend the H-Store in-memory OLTP database engine with stream processing capabilities, creating a single, scalable system for processing stored and streaming data with transactional guarantees [Meehan et al. 2015b]. S-Store was publicly released in 2017.

下次当你向迈克开愚人节玩笑时,请三思!

Next time you make an April Fool’s Day joke to Mike, think twice!

结束语

Concluding Remarks

在过去的二十年里,流处理已经成熟成为一项工业级技术。物联网、实时数据摄取和数据驱动决策等领域的当前趋势和预测表明,该领域的重要性在未来只会继续增长。从大学研究到软件市场,Stonebraker 的流媒体系统在早期定义该领域并确定其方向方面具有巨大影响力。这些系统也是高效协作和团队合作的最佳范例,更不用说它们塑造了许多人的职业和生活。本章中重述的代码线故事仅提供了迈克开创性贡献的这个激动人心的时代的一瞥。

Stream processing has matured into an industrial-strength technology over the past two decades. Current trends and predictions in arenas such as the Internet of Things, real-time data ingestion, and data-driven decision-making indicate that the importance of this field will only continue to grow in the future. Stonebraker’s streaming systems have been immensely influential in defining the field and setting its direction early on, all the way from university research to the software market. These systems have also been great examples of productive collaboration and teamwork at its best, not to mention the fact that they shaped many people’s careers and lives. The codeline stories retold in this chapter provide only a glimpse of this exciting era of Mike’s pioneering contributions.

图像

图 26.3   2014 年 4 月 12 日,Aurora/Borealis/StreamBase 在 MIT Stata 中心重聚,庆祝 Mike 70 岁生日 (Festschrift)。从左到右,前排:Barry Morris、Nesime Tatbul、Magda Balazinska、Stan Zdonik、Mitch Cherniack、Ugur(Çetintemel);后排:John Partridge、Richard Tibbetts 和 Mike Stonebraker。(照片由 Jacek Ambroziak 和 Sam Madden 提供。 )

Figure 26.3  The Aurora/Borealis/StreamBase reunion on April 12, 2014 at MIT Stata Center for Mike’s 70th Birthday Celebration (Festschrift). From left to right, front row: Barry Morris, Nesime Tatbul, Magda Balazinska, Stan Zdonik, Mitch Cherniack, Ugur (Çetintemel; back row: John Partridge, Richard Tibbetts, and Mike Stonebraker. (Photo courtesy of Jacek Ambroziak and Sam Madden.)

致谢

Acknowledgments

感谢 Eddie Galvez、Bobbi Heath 和 Stan Zdonik 提供的有用反馈。

Thanks to Eddie Galvez, Bobbi Heath, and Stan Zdonik for their helpful feedback.

1 . http://cs.brown.edu/research/aurora/。上次访问时间为 2018 年 5 月 14 日。

1. http://cs.brown.edu/research/aurora/. Last accessed May 14, 2018.

2 . http://cs.brown.edu/research/borealis/。上次访问时间为 2018 年 5 月 14 日。

2. http://cs.brown.edu/research/borealis/. Last accessed May 14, 2018.

3 . http://nms.csail.mit.edu/projects/medusa/。上次访问时间为 2018 年 5 月 14 日。

3. http://nms.csail.mit.edu/projects/medusa/. Last accessed May 14, 2018.

4 . http://sstore.cs.brown.edu/。上次访问时间为 2018 年 5 月 14 日。

4. http://sstore.cs.brown.edu/. Last accessed May 14, 2018.

5 . Aurora SIGMOD'03 演示的照片可在以下网址获取:http://cs.brown.edu/research/aurora/Sigmod2003.html。上次访问时间为 2018 年 5 月 14 日。

5. Photos from the Aurora SIGMOD’03 demo are available at: http://cs.brown.edu/research/aurora/Sigmod2003.html. Last accessed May 14, 2018.

6 . Borealis SIGMOD'05 演示的照片可在以下网址获取: http: //cs.brown.edu/research/db/photos/BorealisDemo/index.html。上次访问时间为 2018 年 5 月 14 日。

6. Photos from the Borealis SIGMOD’05 demo are available at: http://cs.brown.edu/research/db/photos/BorealisDemo/index.html. Last accessed May 14, 2018.

7 . 请参阅http://infolab.stanford.edu/sdt/上维护的“Stream Dream Team”页面。(上次访问时间为 2018 年 5 月 14 日)以及 SWiM 2003 会议的详细会议记录,网址为:http://telegraph.cs.berkeley.edu/swim/。(上次访问时间为 2018 年 5 月 14 日)。

7. See the “Stream Dream Team” page maintained at http://infolab.stanford.edu/sdt/. (Last accessed May 14, 2018) and detailed meeting notes from the SWiM 2003 Meeting at http://telegraph.cs.berkeley.edu/swim/. (Last accessed May 14, 2018).

27

27

Vertica 代码线

The Vertica Codeline

希尔帕·拉万德

Shilpa Lawande

Vertica 分析数据库明确地将列存储确立为大规模分析工作负载的卓越架构。Vertica 的旅程始于一个名为 C-Store 的研究项目,该项目是由麻省理工学院、布朗大学、布兰代斯大学和麻省大学波士顿分校的教授合作开展的。当 Michael Stonebraker 和他的商业伙伴 Andy Palmer 于 2005 年决定将其商业化时,C-Store 以一篇研究论文的形式存在,该论文已发送给 VLDB 供出版(但尚未被接受),以及一个运行七个简单程序的 C++ 程序来自开箱即用的 TPC-H 的查询 - 它没有 SQL 前端或查询优化器,为了运行其他查询,您必须使用低级运算符在 C++ 中编写查询计划!六年后(2011 年),Vertica 被惠普企业 (HPE) 收购。

The Vertica Analytic Database unequivocally established column-stores as the superior architecture for large-scale analytical workloads. Vertica’s journey started as a research project called C-Store, a collaboration by professors at MIT, Brown, Brandeis, and UMass Boston. When Michael Stonebraker and his business partner Andy Palmer decided to commercialize it in 2005, C-Store existed in the form of a research paper that had been sent for publication to VLDB (but not yet accepted) and a C++ program that ran exactly seven simple queries from TPC-H out of the box—it has no SQL front-end or query optimizer, and in order to run additional queries, you had to code the query plan in C++ using low level operators! Six years later (2011), Vertica was acquired by Hewlett-Packard Enterprise (HPE). The Vertica Analytics Engine—its code and the engineers behind it—became the foundation of HPE’s “big data” analytics solution.

以下是 Vertica 令人惊叹的旅程的一些亮点,由其早期工程团队成员重述。以及我们一路上学到的一些教训。

What follows are some highlights from the amazing Vertica journey, as retold by members of its early engineering team. And some lessons we learned along the way.

从头开始构建数据库系统

Building a Database System from Scratch

我与 Vertica 的接触始于 2005 年 3 月,当时我在Monster.com上看到一则招聘广告,上面写着 Stonebraker Systems:“为数据仓库构建一些有趣的技术。” 作为一个在 Oracle 感到无聊并在威斯康星大学麦迪逊分校的 DB 课程中学习过 Mike 的红皮书1的人,我当然很感兴趣。第一次面试后我的作业是——你猜对了——阅读便利店报纸 [Stonebraker 等人,2017]。2005a] 并准备与 Mike 讨论(我们继续遵循这种做法,除了最终该论文被便利店七年后的论文所取代 [Lamb et al. 2012],并且由一名或多名高级开发人员进行面试)。我不太记得第一次采访的内容,但离开时受到迈克的演讲的启发:“我们成功或失败并不重要。您会构建一个有趣的系统。世界上有多少人能够从头开始构建数据库系统?” 这就是我加入 Vertica 的原因(请参阅第 18 章)。

My involvement with Vertica started in March 2005 when I came across a job ad on Monster.com that said Stonebraker Systems: “Building some interesting technology for data warehousing.” As someone who was getting bored at Oracle and had studied Mike’s Red Book1 during my DB classes at University of Wisconsin-Madison, I was intrigued, for sure. My homework after the first interview was—you guessed it—read the C-Store paper [Stonebraker et al. 2005a] and be ready to discuss it with Mike (a practice we continued to follow, except eventually the paper was replaced with the C-Store Seven Years Later paper [Lamb et al. 2012], and the interview conducted by one or more senior developers). I do not recall much of that first interview but came away inspired by Mike’s pitch: “It doesn’t matter whether we succeed or fail. You would have built an interesting system. How many people in the world get to build a database system from scratch?” And that’s why I joined Vertica (see Chapter 18).

早期充满了初创公司常见的混乱:困难的事情,比如让团队凝聚在一起,简单的事情,比如编写代码,更困难的事情,比如整理关于是否使用基于推式或拉式的数据的分歧——流程操作员(以及建筑物对于这些人来说太热还是对我来说太冷),编写更多代码,等等。

The early days were filled with the usual chaos that is the stuff of startups: hard stuff like getting the team to jell, easier stuff like writing code, more hard stuff like sorting through disagreements on whether to use push- or pull-based data-flow operators (and whether the building was too hot for the guys or too cold for me), writing some more code, and so on.

2005 年夏天,我们聘请了查克·贝尔 (Chuck Bear),他当时住在他上一家公司的地下室里,沿着阿巴拉契亚小道 (Appalachian Trail) 工作。查克面试结束后,迈克闯入工程会议说:“我们必须不惜一切代价聘用这个人!” 由于团队人员齐全,查克被要求进行“性能测试”。没过多久,每个人都意识到查克作为“测试员”(迈克称之为质量保证工程师)的才能没有得到充分利用。有一次,查克无法让一位工程师相信我们可以比 C-Store 快得多,因此,在接下来的几个晚上,当他的测试运行时,他编写了一堆运行速度快 2 倍的代码比签入的内容还要多!

In the summer of 2005, we hired Chuck Bear, who at the time was living out of his last company’s basement and working his way down the Appalachian Trail. After Chuck’s interview, Mike barged into the engineering meeting saying, “We must do whatever it takes to hire this guy!” And since the team was fully staffed, Chuck got asked to do “performance testing.” It did not take long for everyone to realize that Chuck’s talents were underutilized as a “tester” (as Mike called quality assurance engineers). There was one occasion where Chuck couldn’t convince one of the engineers that we could be way faster than C-Store, so, over the next few nights, while his tests were running, he wrote a bunch of code that ran 2× faster than what was checked in!

Vertica 的第一个商业版本已经比 C-Store 快了好几倍,而我们才刚刚开始,这是一项了不起的工程壮举!从那时起,C-Store 和 Vertica 沿着不同的道路发展。Vertica 继续构建了一个成熟的 PB 级分布式数据库系统,但我们确实与研究团队保持了密切联系,分享了想法,特别是与 Daniel Abadi 和 Sam Madden 分享了关于查询执行的想法,与 Brandeis 的 Mitch Cherniack 分享了关于查询优化的想法,以及与布朗大学的 Stan Zdonik 和 Alex Rasin 一起进行自动数据库设计。Vertica 必须从现实世界的经验中改进便利店论文中的许多想法,但 Daniel Abadi 博士中的想法。关于压缩列存储的论文仍然是 Vertica 引擎的核心,我们都应该很高兴他选择了计算机科学而不是医学。

The first commercial version of Vertica was already several times faster than C-Store, and we were only just getting going, a fantastic feat of engineering! From here on, C-Store and Vertica evolved along separate paths. Vertica went on to build a full-fledged petabyte-scale distributed database system, but we did keep in close touch with the research team, sharing ideas, especially on query execution with Daniel Abadi and Sam Madden, on query optimization with Mitch Cherniack at Brandeis, and on automatic database design with Stan Zdonik and Alex Rasin at Brown. Vertica had to evolve many of the ideas in the C-Store paper from real-world experience, but the ideas in Daniel Abadi’s Ph.D. thesis on compressed column stores still remained at the heart of Vertica’s engine, and we should all be glad he chose computer science over medicine.

课。在有效的软件工程组织中,最好的想法会获胜。代码库的共享所有权至关重要。而且,如果无法用语言解决分歧,就用代码来解决。

Lesson. In effective software engineering organizations, the best ideas win. Shared ownership of the code base is essential. And, if you can’t resolve a disagreement with words, do it with code.

代码遇见客户

Code Meets Customers

Vertica 的代码线之旅是当今所谓的“精益创业”的一个很好的例子——迈克再次领先于他的时代(见第7 章)。首先“Alpha”版本应该只执行七个 C 存储查询,但使用 SQL 前端,而不是 C++,并在单个节点上运行。为此,我们决定使用“残酷的 Postgres”(参见第 16 章),抛弃除了解析器和相关数据结构之外的所有内容(为什么要重新发明轮子?),并将其从多进程模型转换为单进程模型。进程多线程模型。还可以选择忽略:很多你无法想象数据库无法完成的事情!

The codeline journey of Vertica was a good example of what is called a “Lean Startup” these days—again Mike was ahead of his time (see Chapter 7). The first version “Alpha” was supposed to only do the seven C-Store queries, but with an SQL front-end, not C++ and run on a single node. To do this, the decision was to use a “brutalized Postgres” (see Chapter 16), throwing away everything except its parser and associated data structures (why reinvent the wheel?) and converting it from a multi-process model to a single-process multi-threaded model. Also left out by choice: a lot of things that you can’t imagine a database not being able to do!

奥马尔·特拉吉曼 (Omer Trajman) 是早期工程师之一。后来,他继续领导现场工程团队(负责帮助在客户站点部署 Vertica)。他回忆道:

Omer Trajman was one of the early engineers. He later went on to run the Field Engineering team (charged with helping deploy Vertica in customer sites). He recalls:

其中一个选择是推迟删除的实现,这对于新的高性能数据库来说是一个疯狂的限制。在 Vertica 的第一个商业版本中,如果用户加载数据时出错,则无法更改、更新甚至删除数据。唯一可用于丢弃数据的命令是删除数据库并重新开始。作为必须从平面文件重新加载数据的解决方法,该团队后来添加了 INSERT/SELECT 来创建已加载数据的副本,并应用了一些转换(包括删除行)。添加重命名和删除表的功能后,自动删除的基本构建块就位了。事实证明,这对于 Vertica 的目标市场来说是正确的决定。

One of these choices was pushing off the implementation of delete, a crazy limitation for a new high-performance database. In the first commercial versions of Vertica, if a user made a mistake loading data, the data couldn’t be changed, updated, or even deleted. The only command available to discard data was to drop the database and start over. As a workaround to having to reload data from flat files, the team later added INSERT/SELECT to order to create a copy of loaded data with some transformation applied, including removing rows. After adding the ability to rename and drop tables, the basic building blocks to automate deletes were in place. As it turns out, this was the right decision for Vertica’s target market.

Vertica 团队发现有两种类型的理想早期客户:数据几乎从未改变的客户和数据一直在改变的客户。对于拥有相对静态数据的人来说,Vertica 提供了最快、最有效的分析响应时间。对于数据不断变化的人来说,Vertica 能够比市场上任何其他解决方案更快地从原始数据过渡到快速查询。为了从 Vertica 获得显着价值,这两种客户类型除了删除表之外都不需要删除数据。数据很少更改的客户能够准备数据并确保其正确加载。数据快速变化的客户没有时间进行更正。迈克和他的团队有一个在当时看来很可笑的真正见解:一个无法删除数据的商业数据库。

The Vertica team found that there were two types of ideal early customers: those whose data almost never changed, and those whose data changed all the time. For people with relatively static data, Vertica provided the fastest and most efficient response times for analytics. For people whose data changed all the time, Vertica was able to go from raw data to fast queries more quickly than any other solution in the market. To get significant value from Vertica, neither customer type needed to delete data beyond dropping tables. Customers with data that rarely changed were able to prepare it and make sure it was properly loaded. Customers with rapidly changing data did not have the time to make corrections. Mike and the team had a genuine insight that at the time seemed ludicrous: a commercial database that can’t delete data.

课。尽早并经常与客户合作。仔细听。不要被传统智慧所束缚。

Lesson. Work with customers, early and often. Listen carefully. Don’t be constrained by conventional wisdom.

不要重新发明轮子(让它变得更好)

Don’t Reinvent the Wheel (Make It Better)

关于构建什么和不构建什么的讨论在撰写学术便利店论文的教授之间不乏讨价还价 [Stonebraker 等人。2005a] 和正在构建现实世界 Vertica 的工程师。这是查克·贝尔回忆那些日子的故事。

Discussions about what to build and what not weren’t without a share of haggling between the professors who wrote the academic C-Store paper [Stonebraker et al. 2005a] and engineers who were building the real world Vertica. Here’s Chuck Bear recounting those days.

早在 2006 年,教授们每周都会拜访 Vertica,以确保我们(工程师)使用良好的设计并正确构建系统。当我们告诉 Mike 和 Dave DeWitt 2我们正在考虑处理多个用户和事务的方法时,也许是某种乐观并发控制或多版本控制,他们对我们大喊大叫,用很多话说道:“就做锁定吧!你不懂锁!我们会给你一份关于锁定的教科书章节的副本!” 此外,他们还告诉我们调查海岸存储经理 [Carey 等人。1994],思考也许我们可以重用它的锁定实现。

Back in 2006, the professors used to drop by Vertica every week to make sure we (the engineers) were using good designs and otherwise building the system correctly. When we told Mike and Dave DeWitt2 that we were mulling approaches to multiple users and transactions, maybe some sort of optimistic concurrency control or multi-versioning, they yelled at us and said, in so many words, “Just do locking! You don’t understand locking! We’ll get you a copy of our textbook chapter on locking!” Also, they told us to look into the Shore storage manager [Carey et al. 1994], thinking maybe we could reuse its locking implementation.

我们阅读了他们提供给我们的关于锁定的章节的复印件,接下来的一周我们就做好了准备。首先,我们感谢教授们建议的阅读材料。但随后我们向他们提出了一些棘手的问题……“在像 Vertica 这样的系统中,写入者不会向读者读取的地方写入数据,锁定是如何工作的?如果您有一个高度压缩的表,那么 RLE 3列上的页级锁不会本质上锁定整个表吗?”

We read the photocopy of the chapter on locking that they provided us, and the following week we were prepared. First, we thanked the professors for their suggested reading material. But then we hit them with the hard questions … “How does locking work in a system like Vertica where writers don’t write to the place where readers read? If you have a highly compressed table, won’t a pagelevel lock on an RLE3 column essentially lock the whole table?”

最后,他们接受了我们的妥协想法,即我们“只做锁定”以支持事务,但在表级别,此外读者可以拍摄快照,因此他们根本不需要任何锁。教授们一致认为,对于早期版本来说,这是一个合理的设计,事实上,十几年后仍然如此。

In the end, they accepted our compromise idea, that we’d “just do locking” for transaction support, but at the table level, and additionally readers could take snapshots so they didn’t need any locks at all. The professors agreed that it was a reasonable design for the early versions, and in fact it remains this way over ten years later.

这就是很多事情的运作方式。如果你能得到一个既得到教授认可又工程师认为他们可以建造的设计,那么你就赢了。

That’s the way lots of things worked. If you could get a design that was both professor-approved and that the engineers figured they could build, you had a winner.

课。这个决定是“保持简单、愚蠢”(又名 KISS 原则)和“为常见情况构建”的一个很好的案例研究,这两个关键的系统设计原则可能在研究生院教授,但只能通过学校来巩固的重击。

Lesson. This decision is a great case study for “Keep it simple, stupid,” (aka KISS principle) and “Build for the common case,” two crucial systems design principles that are perhaps taught in graduate school but can only be cemented through the school of hard knocks.

架构决策:研究与现实生活的结合

Architectural Decisions: Where Research Meets Real Life

关于锁定的决定是我们在 Vertica 早年反复学习的一个例子:“教授并不总是正确的”和“客户总是获胜的”。

The decision about locking was an example of something we learned over and over during Vertica’s early years: that “professors aren’t always right” and “the customer always wins.”

2012 年论文“Vertica 分析数据库:7 年后的便利店”[Lamb 等人,2012 年] 2012]对最初的便利商店论文中的学术建议进行了全面的回顾,这些论文在现实世界的部署中幸存了下来,而其他的建议却被证明是极其错误的。

The 2012 paper “The Vertica Analytic Database: C-Store 7 years later” [Lamb et al. 2012] provides a comprehensive retrospective on the academic proposals from the original C-Store paper that survived the test of real-world deployments—and others that turned out to be spectacularly wrong.

例如,排列4的想法完全是一场灾难。它使系统速度减慢到无用的程度,并且很早就被放弃了。列的后期具体化在一定程度上对于谓词和简单连接有效,但一旦引入更复杂的连接,效果就不那么好了。最初的假设是大多数数据仓库模式 [Kimball 和 Ross 2013] 是“星型”或“雪花型”,该假设很好地服务了系统,吸引了一些早期客户,但很快就不得不重新审视。该优化器后来适应了“几乎星形”或“倒雪花”模式,然后最终完全重写为通用分布式查询优化器。最终,Vertica 的优化器和执行引擎做了一些非常聪明的技巧,包括在查询优化期间利用数据分段信息(而不是先构建单个节点计划,然后对其进行并行化,正如大多数商业优化器倾向于做的那样);将优化器决策(例如连接算法的类型)延迟到运行时;等等。

For instance, the idea of permutations4 was a complete disaster. It slowed the system down to the point of being useless and was abandoned very early on. Late materialization of columns worked to an extent, for predicates and simple joins, but did not do so well once more complex joins were introduced. The original assumption that most data warehouse schemas [Kimball and Ross 2013] were “Star” or “Snowflake” served the system well in getting some early customers but soon had to be revisited. The optimizer was later adapted for “almost star” or “inverted snowflake” schemas and then was ultimately completely rewritten to be a general distributed query optimizer. Eventually, Vertica’s optimizer and execution engine did some very clever tricks, including leveraging information on data segmentation during query optimization (vs. building a single node plan first and then parallelizing it, as most commercial optimizers tend to do); delaying optimizer decisions like type of join algorithm until runtime; and so on.

另一个需要多次迭代和现场经验才能正确的架构决策是 Tuple Mover 的设计。这是该组件的早期首席工程师 Dmitry Bochkov 回忆起他在这段时间与 Mike 的互动。

Another architectural decision that took several iterations and field experience to get right was the design of the Tuple Mover. Here’s Dmitry Bochkov, the early lead engineer for this component, reminiscing about his interactions with Mike during this time.

Vertica 第一个版本中 Tuple Mover 设计的演变向我证明了 Mike 支持从学术方法到“工程小问题”以及返回的能力。最初是 LSM(日志结构合并树)的简单实现,很快就退化为一个复杂的、低性能的组件,受到相同数据的低效多次重写以及与执行引擎和存储访问层竞争的锁定系统的困扰。锁定机制。

The evolution of the Tuple Mover design in the first versions of Vertica demonstrated to me Mike’s ability to support switching from academic approach to “small matters of engineering” and back. What started as a simple implementation of an LSM (log-structured merge-tree) quickly degenerated into a complicated, low-performance component plagued by inefficient multiple rewrites of the same data and a locking system that competed with the Execution Engine and Storage Access Layer locking mechanisms.

经过几轮设计会议,看起来更像是论文答辩,我将永远记得我从迈克那里得到的第一个认可的点头。接下来的是移出和合并算法最终使用“我们自己的狗粮”。我们自己的执行引擎用于运行 Tuple Mover 操作,以更好地处理事务、资源规划、故障转移和其他任务之间的协调。虽然它给其他组件增加了巨大的压力,但它使 Tuple Mover 成为 Stonebraker 博士高性能分布式数据库愿景中不可或缺的一部分。

It took a few rounds of design sessions that looked more like thesis defense, and I will forever remember the first approving nod I received from Mike. What followed was that the moveout and mergeout algorithms ended up using “our own dog food.” Our own Execution Engine was used for running the Tuple Mover operations to better handle transactions, resources planning, failover, and reconciliation among other tasks. And while it added significant pressure on other components, it allowed the Tuple Mover to become an integral part of Dr. Stonebraker’s vision of a high-performance distributed database.

任何与迈克共事过的人都知道他是一个寡言少语的人,如果你仔细听,你可以从他的简洁中学到很多东西。如果你早期在Vertica工作时,经常会听到Mike-isms,比如“购买下游农场”(指“工程债务”)5以及著名的“超越Mike的尸体”(OMDB)。这些短语指的是数据库系统中充满了 Vertica 永远不会构建的所有“花里胡哨”,完美地捕捉了 Vertica 在其一生中反复面临的“研究”和“现实生活”选择之间的紧张关系。

Anyone who has worked with Mike knows he is a man of few words, and if you listen carefully, you can learn a massive amount from his terseness. If you worked at Vertica in the early days, you often heard Mike-isms, such as “buying a downstream farm” (referring to “engineering debt”)5 and the famous “over Mike’s dead body” (OMDB). These phrases referred to all the “bells and whistles” that database systems are filled with that Vertica would never build, perfectly capturing the tension between “research” and “real-life” choices that Vertica faced repeatedly over its life.

肖敏(6 号创始工程师转为销售工程师)描述了与 Mike 的 OMDB 遭遇。

Min Xiao,6 founding engineer turned sales engineer, describes an OMDB encounter with Mike.

2008 年的一天,我拜访完一位全球银行客户后回到办公室。我看到迈克穿着一件红色衬衫,坐在一个小角落的会议室里用笔记本电脑工作。我介入并告诉他,银行需要 Vertica 的灾难恢复 (DR) 功能。过去,Mike 一直希望我让他了解客户的产品需求。对于该客户,他们的主要 Vertica 实例位于曼哈顿,他们希望在新泽西州有一个 DR 实例。在 Vertica 之前,他们曾在同一项目中使用过 Oracle,因此也希望拥有通过更改数据捕获逐条语句的灾难恢复类型。迈克听了我一会儿。显然,他是听到了别人的要求,并没有表现出任何惊讶的表情。他看着我,平静地说:“他们不需要那种类型的灾难恢复解决方案。他们所需要的只是通过并行加载进行主动复制。” 一如既往,答案简洁而精确。当我花了一些时间消化他的答案时,他注意到我的犹豫并补充道“在我的尸体上”。我回到客户那里,与他们沟通了复制副本的提议。银行并没有过于兴奋,但没有再提出 DR 请求。与此同时,我们最大的(非银行)客户之一(从未使用过 Oracle)完全实施了 Mike 的提议,并且对此非常满意。他们并行加载到两个 115 个节点的集群中,并使用它们相互恢复。“我回到客户那里,与他们沟通了复制副本的提议。银行并没有过于兴奋,但没有再提出 DR 请求。与此同时,我们最大的(非银行)客户之一(从未使用过 Oracle)完全实施了 Mike 的提议,并且对此非常满意。他们并行加载到两个 115 个节点的集群中,并使用它们相互恢复。“我回到客户那里,与他们沟通了复制副本的提议。银行并没有过于兴奋,但没有再提出 DR 请求。与此同时,我们最大的(非银行)客户之一(从未使用过 Oracle)完全实施了 Mike 的提议,并且对此非常满意。他们并行加载到两个 115 个节点的集群中,并使用它们相互恢复。

One day in 2008, I came back to the office after visiting a global bank customer. I saw that Mike, wearing a red shirt, sat in a small corner conference room working on his laptop. I stepped in and told him that the bank needed the feature of disaster recovery (DR) from Vertica. In the past, Mike had always wanted me to let him know the product requests from the customers. For this customer, their primary Vertica instance was in Manhattan and they wanted a DR instance in New Jersey. They had used Oracle for the same project prior to Vertica and therefore also hoped to have a statement-by-statement-via-change-data-capture type of DR. Mike listened to me for a minute. Apparently, he had heard the request from someone else and didn’t look surprised at all. He looked at me and calmly said “They don’t need that type of DR solution. All they need is an active replication thru parallel loading.” As always, the answer was concise as well as precise. While I took a moment to digest his answer, he noticed my hesitation and added “over my dead body.” I went back to the customer and communicated with them about the proposal of having a replicated copy. The bank wasn’t overly excited but didn’t raise the DR request anymore. Meanwhile, one of our largest (non-bank) customers, who had never used Oracle, implemented exactly what Mike had proposed and was very happy with it. They loaded into two 115-node clusters in parallel and used them to recover from each other.

课。复杂性往往是大规模分布式系统的致命弱点,正如 Daniel Abadi 在第 18 章中生动详细地描述的那样,Mike 讨厌复杂性。通过广泛使用的短语“OMDB”,Mike 迫使我们认真思考我们添加的每一项功能,以确保它是真正需要的,随着客户群的增长,这种做法对我们很有帮助。Vertica 成功的原因之一是我们非常认真地思考不应该添加什么,尽管来自以下方面的压力很大:顾客。有时,随着系统的发展以服务不同类别的客户,我们不得不对一些早期的决定做出让步,但我们仍然总是在考虑复杂性的问题上进行了长期而艰难的思考。

Lesson. Complexity is often the Achilles’ heel of large-scale distributed systems, and as Daniel Abadi describes in vivid detail in Chapter 18, Mike hated complexity. With the liberally used phrase, OMDB, Mike forced us to think hard about every feature we added, to ensure it was truly required, a practice that served us well as our customer base grew. One of the reasons for Vertica’s success was that we thought very hard about what NOT to add, even though there was a ton of pressure from customers. Sometimes we had to relent on some earlier decisions as the system evolved to serve different classes of customers, but we still always thought long and hard about taking on complexity.

客户:开发团队最重要的成员

Customers: The Most Important Members of the Dev Team

正如我们认真思考要添加哪些功能一样,我们也非常仔细地倾听客户的真正要求。有时客户会要求一项功能,但我们会深入研究他们面临的问题,并经常发现几个看似不同的请求通常可以通过一个“功能”来满足。工程与客户之间的紧密合作从一开始就成为我们文化的一个关键方面。工程师们从听到客户遇到的问题中获益匪浅。工程、客户支持和现场工程师密切合作,确定客户问题的解决方案,反馈通常会带来改进,有些是渐进式的,但有时是巨大的。

Just as we thought hard about what features to add, we also listened very carefully to what customers were really asking for. Sometimes customers would ask for a feature, but we would dig into what problem they faced instead and often find that several seemingly different requests could often be fulfilled with one “feature.” Tight collaboration between engineering and customers became a key aspect of our culture from early on. Engineers thrived from hearing about the problems customers were having. Engineering, Customer Support, and Field Engineers all worked closely together to determine solutions to customer problems and the feedback often led to improvements, some incremental, but sometimes monumental.

这种合作最早的例子是 2008 年最大的算法交易公司之一成为客户。肖敏回忆起这家交易公司的创始人在一个星期四下午到我们位于马萨诸塞州比尔里卡的办公室进行的一日游。

The earliest example of such a collaboration was when one of the largest algorithm trading firms became a customer in 2008. Min Xiao recalls a day trip by the founders of this trading firm to our office in Billerica, Massachusetts, one Thursday afternoon.

他们的首席技术官是迈克的忠实粉丝。经过几个小时的激烈讨论后,我们礼貌地询问他们是否需要前往机场的交通。(那是在 Uber 出现之前。)他们的首席执行官漫不经心地拒绝了我们的请求。后来我们才发现他们没有真正的日程限制,因为他们乘坐的是自己的公务机。不仅如此,当他发现 Mike 弹班卓琴后,第二天他就把贝斯吉他带到了 Vertica 办公室。Mike、Stan Zdonik(布朗大学教授)和 John“JR”Robinson(Vertica 的创始工程师)一起演奏了几个小时的蓝草音乐。这并不是一个孤立的“迈克粉丝”:客户喜爱并尊重迈克的技术知识和直言不讳。我们经常开玩笑说他是我们有史以来最好的销售员。:-)

Their CTO was a big fan of Mike. After several hours of intense discussions with us, we politely asked if they needed transportation to the airport. (This was before the days of Uber.) Their CEO casually brushed aside our request. Only later we found out that they had no real schedule constraints because they had flown in their own corporate jet. Not only that, but once he found out that Mike played the banjo, the next day he brought his bass guitar to the Vertica office. Mike, Stan Zdonik (a professor in Brown University), and John “JR” Robinson (a founding engineer of Vertica) played bluegrass together for several hours. This wasn’t an isolated “Mike fan”: customers loved and respected Mike for his technical knowledge and straight talk. We often joked that he was our best salesperson ever. :-)

随着时间的推移,该客户成为 Vertica 非常密切的开发合作伙伴。他们自愿帮助我们构建时间序列窗口函数,这是一个最初位于“OMDB”列表中的功能集。由于Vertica的压缩和排序的列式数据存储,许多在其他数据库中通常需要很长时间才能执行的窗口函数在Vertica中可以运行得非常快。

Over time, this customer became a very close development partner to Vertica. They voluntarily helped us build Time-series Window functions, a feature-set that was originally on the “OMDB” list. Due to Vertica’s compressed and sorted columnar data storage, many of the windowing functions, which often take a long time to execute in other databases, could run blazingly fast in Vertica.

我记得工程师在实践中看到他们的工作成果时感到的兴奋。

I recall the thrill that engineers felt to see the fruits of their work in practice.

当这位客户在 10 万亿行历史数据上运行亚秒级查询时达到了里程碑,这对工程团队来说是一个值得庆祝的日子。交易数据!这些时间序列函数后来成为 Vertica 的主要性能差异化因素之一,并支持使用相当简单的 SQL 命令来表达非常复杂的日志分析。

It was a day of great celebration for the engineering team when this customer reached a milestone running sub-second queries on 10 trillion rows of historical trading data! These time-series functions later become one of the major performance differentiators for Vertica, and enabled very sophisticated log analytics to be expressed using rather simple SQL commands.

Vertica 的一个重大技术拐点出现在 2009 年左右,当时我们开始在网络和社交游戏领域吸引客户。这些公司确实将 Vertica 的规模扩大到能够在生产中处理 PB 级的数据。需要多次迭代才能真正让“涓流加载”发挥作用,但最终该客户拥有了一个架构,其中所有游戏的每次点击都会进入数据库,但他们能够“近乎实时”地更新分析模型”。

A big technical inflection point for Vertica came around 2009, when we started to land customers in the Web and social gaming areas. These companies really pushed Vertica’s scale to being able to handle petabytes of data in production. It took many iterations to really get “trickle loads” to work, but in the end this customer had an architecture where every click from all their games went into the database, and yet they were able to update analytical models in “near real-time.”

当一位非常知名的社交媒体客户决定在 300 个非常便宜且不可靠的硬件节点上运行 Vertica 时,另一个转折点出现了。想象一下,当我们在如此规模的集群上获得第一个支持案例时,我们感到多么震惊!该客户迫使团队真正考虑高可用性以及节点可能随时关闭的想法。因此,必须针对此用例审查整个系统(从目录到恢复再到集群扩展)。此时,越来越多的客户希望在云上运行,所有这些工作都被证明对于支持该用例来说是非常宝贵的。

Another inflection point came when a very high profile social media customer decided to run Vertica on 300 nodes of very cheap and unreliable hardware. Imagine our shock when we got the first support case on a cluster of this size! This customer forced the team to really think about high availability and the idea that nodes could be down any time. As result, the entire system—from the catalog to recovery to cluster expansion—had to be reviewed for this use case. By this time, more and more customers wanted to run on the cloud, and all this work proved invaluable to support that use case.

课。让工程师贴近客户。也许一起创作一些音乐。仔细聆听他们的问题。与他们合作制定解决方案。不要害怕迭代。对于工程师来说,没有比发现他或她的代码在现实世界中不起作用更大的动力了,也没有比看到他们的代码对客户的业务产生影响更大的奖励了!

Lesson. Keep engineers close to customers. Maybe make some music together. Listen carefully to their problems. Collaborate with them on solutions. Don’t be afraid to iterate. There is no greater motivator for an engineer than to find out his or her code didn’t work in the real world, nor greater reward than seeing their code make a difference to a customer’s business!

结论

Conclusion

Vertica 的故事是众多大胆赌注之一,其中一些从学术概念出发是正确的,而另一些则需要大量艰苦的工程才能实现。这也是教授和工程师之间富有成效的合作的故事。最重要的是,这是一个关于小型初创公司如何通过与客户密切合作来改变行业流行标准的故事,就像 Vertica 对数据仓库和大数据分析实践所做的那样。

Vertica’s story is one of a lot of bold bets, some of which worked right from academic concept, and others that took a lot of hard engineering to get right. It is also a story of fruitful collaboration between professors and engineers. Most of all, it is a story of how a small startup, by working closely with customers, can change the prevailing standard of an industry, as Vertica did to the practices of data warehousing and big data analytics.

致谢

Acknowledgments

感谢早期 Vertica Engineering 团队的 Chuck Bear、Dmitry Bochkov、Omer Trajman 和 Min Xiao 分享他们在本章中的故事。

Thank you to Chuck Bear, Dmitry Bochkov, Omer Trajman, and Min Xiao of the early Vertica Engineering team for sharing their stories for this chapter.

1 . 数据库系统读物 http://www.redbook.io/

1. Readings in Database Systems http://www.redbook.io/.

2 . Vertica 技术顾问委员会成员Dave DeWitt(请参阅第 6 章)经常访问 Vertica 团队。

2. Dave DeWitt (see Chapter 6), on Vertica’s technical advisory board, often visited the Vertica team.

3 . 游程长度编码

3. Run Length Encoding

4 . 不同排序顺序的多个投影可以在运行时组合以重新创建完整表的想法。最终,它被包含所有列的超级投影的概念所取代。

4. The idea that multiple projections in different sort orders could be combined at runtime to recreate the full table. Eventually, it was replaced by the notion of a super projection that contains all the columns.

5 . 沿河下游的农场总是会被洪水淹没,而且可能看起来更便宜。这是工程债务的类比,是为了节省短期编码工作而做出的决定,从长远来看,这些工作需要大量的时间和精力(即成本)。

5. A farm downstream along a river will always be flooded and may appear to be cheaper. This is an analogy for engineering debt, decisions made to save short-term coding work that required a ton of time and effort (i.e., cost) in the long run.

6 . 肖敏跟随 Mike 和 Andy Palmer 加入 Tamr, Inc. 的创始团队。

6. Min Xiao followed Mike and Andy Palmer to join the founding team of Tamr, Inc.

28

28

VoltDB 代码线

The VoltDB Codeline

约翰·哈格

John Hugg

Mike Stonebraker 聘请我负责 H-Store 1研究的商业化 [Stonebraker 等人。2007b] 2008 年初。第一年,我在 Mike Stonebraker 的密切监督下与学术研究人员合作构建原型。2 Andy Pavlo 和我在 VLDB 2008 上展示了我们的早期结果 [Kallman 等人。2008年]当年8月。然后,我帮助领导了 VoltDB 的商业化工作,最终在接下来的十年里与我有幸合作的团队一起开发 VoltDB。在我在 VoltDB, Inc. 工作期间,Mike Stonebraker 担任我们的首席技术官,然后担任顾问,为团队提供智慧和方向。

I was hired by Mike Stonebraker to commercialize the H-Store1 research [Stonebraker et al. 2007b] in early 2008. For the first year, I collaborated with academic researchers building the prototype, with close oversight from Mike Stonebraker.2 Andy Pavlo and I presented our early results at VLDB 2008 [Kallman et al. 2008] in August of that year. I then helped lead the efforts to commercialize VoltDB, ultimately spending the next ten years developing VoltDB with a team I was privileged to work with. In my time at VoltDB, Inc., Mike Stonebraker served as our CTO and then advisor, offering wisdom and direction for the team.

VoltDB 是在 Vertica 3成功之后构思的;如果专用于分析数据的系统 Vertica 能够在分析工作负载方面胜过通用系统一个数量级,那么专用于操作数据的系统是否可以在操作工作负载方面胜过通用系统一个数量级?这是 Mike Stonebraker 反对通用数据库的下一步。

VoltDB was conceived after the success of Vertica3; if, Vertica, a system dedicated to analytical data, could beat a general-purpose system by an order of magnitude at analytical workloads, could a system dedicated to operational data do the same for operational workloads? This was the next step in Mike Stonebraker’s crusade against the one-size-fits-all database.

VoltDB 是一个无共享的分布式 OLTP 数据库。重新思考关于传统系统的假设,VoltDB 抛弃了共享内存并发、缓冲池和传统磁盘持久性以及客户端事务控制。它假设大容量 OLTP 工作负载大多可水平分区,并且分析将迁移到专用系统,从而保持查询简短。

VoltDB was to be a shared-nothing, distributed OLTP database. Rethinking assumptions about traditional systems, VoltDB threw out shared-memory concurrency, buffer pools and traditional disk persistence, and client-side transaction control. It assumed that high-volume OLTP workloads were mostly horizontally partitionable, and that analytics would migrate to special-purpose systems, keeping queries short.

所提出的系统将大大减少扩展问题,支持本机复制和高可用性,并在不牺牲事务和强一致性的情况下降低运营工作负载的成本。

The proposed system would dramatically reduce scaling issues, support native replication and high availability, and reduce costs for operational workloads without sacrificing transactions and strong consistency.

经过近两年的内部开发,VoltDB 1.0 最初于 2010 年 4 月发布。H-Store 学术项目的工作同时继续进行。多年来,研究人员和 VoltDB 工程团队共享了许多想法和实验结果,但由于两个系统具有不同的目的,代码出现了分歧。VoltDB 还聘请了一些从事 H-Store 项目的研究生。

VoltDB 1.0 was originally released in April 2010, after nearly two years of internal development. Work on the H-Store academic project continued in parallel. Over the years, many ideas and experimental results were shared between the researchers and the VoltDB engineering team, but code diverged as the two systems had different purposes. VoltDB also hired a number of graduate students who worked on the H-Store project.

压实4

Compaction4

2010 年秋天,第一个既勇敢又愚蠢的客户在生产中使用 VoltDB 1.x,并遇到了内存使用方面的挑战。

In the Fall of 2010, the very first customer, who was equal parts brave and foolish, was using VoltDB 1.x in production and was running into challenges with memory usage.

该客户使用操作系统报告的 VoltDB 进程的驻留集大小 (RSS) 作为关键指标。虽然内存使用情况监控比磁盘使用情况监控更复杂,但在大多数情况下这是一个很好的指标。

This customer was using the resident set size (RSS) for the VoltDB process as reported by the OS as the key metric. While memory usage monitoring is more complex than disk usage monitoring, this is a good metric to use in most cases.

问题是 RSS 会随着使用而增加,尽管数据并没有增加。是的,记录正在被更新、删除和添加,但记录总数及其所代表的逻辑数据的大小并没有增长。然而,最终,VoltDB 将使用机器上的所有内存。这位早期客户被迫定期重新启动 VoltDB,这对于专为正常运行时间而设计的系统来说并不好。不用说,这对于专注于操作工作负载的内存数据库来说是不可接受的。

The problem was that the RSS was increasing with use, even though the data was not growing. Yes, records were being updated, deleted, and added, but the total number of records and the size of the logical data they represented was not growing. However, eventually, VoltDB would use all of the memory on the machine. This early customer was forced to restart VoltDB on a periodic basis—not great for a system designed for uptime. Needless to say, this was unacceptable for an in-memory database focused on operational workloads.

该问题很快被确定为分配器碎片。在这一切之下,VoltDB 使用 GNU LibC malloc,它分配大块虚拟地址空间,并根据请求分配较小的块。当一个slab在逻辑上只使用了一半,但可用于服务新分配的“空洞”太小而无用时,就会发生分配器碎片。

The problem was quickly identified as allocator fragmentation. Under it all, VoltDB was using GNU LibC malloc, which allocated big slabs of virtual address space and doled out smaller chunks on request. Allocator fragmentation happens when a slab is logically only half used, but the “holes” that can be used to service new allocations are too small to be useful.

有两种主要方法可以解决这个问题。最常见的方法是使用自定义分配器。两个最常见的替代方案是 JEMalloc 和 TCMalloc。两者在避免碎片浪费方面比默认的 GLibC malloc 更加复杂。

There are two main ways to deal with this problem. The most common approach is to use a custom allocator. The two most common alternatives are JEMalloc and TCMalloc. Both are substantially more sophisticated at avoiding fragmentation waste than the default GLibC malloc.

VoltDB 团队首先尝试了这些选项,但遇到了挑战,因为 VoltDB 在同一进程中混合了 C++ 和 Java。当时,在进程内 JVM 中使用这些分配器非常具有挑战性。

The VoltDB team tried these options first but ran into challenges because VoltDB mixed C++ and Java in the same process. Using these allocators with the in-process JVM was challenging at the time.

第二种方法更具挑战性,也更有效,是自己完成所有分配。您实际上不必管理 100% 的分配。短期分配和永久分配往往不会导致分配器碎片。您主要需要担心生命周期未知且可变的数据,这对于任何内存数据库都至关重要。

The second approach, which is both more challenging and more effective, is do all the allocation yourself. You don’t actually have to manage 100% of allocations. Short-lived allocations and permanent allocations tend not to contribute to allocator fragmentation. You primarily have to worry about data with unknown and variable life cycles, which is really critical for any in-memory database.

该团队专注于符合此配置文件的三种主要内存使用类型。

The team focused on three main types of memory usage that fit this profile.

• 元组存储—每个表的固定大小元组的逻辑数组。

•  Tuple storage—a logical array of fixed size tuples per table.

• Blob 存储—一组从元组链接的可变大小的二进制对象。

•  Blob storage—a set of variable-sized binary objects linked from tuples.

• 索引存储——通过键提供对元组的快速访问的树和哈希表。

•  Index storage—trees and hash tables that provide fast access to tuples by key.

两个团队着手实施两种不同的方法,看看哪种方法效果最好。

Two teams set about implementing two different approaches to see which might work best.

第一个团队负责索引和 blob 存储。该计划是以这样的方式重新设计这些数据结构,使它们根本不存在任何“漏洞”。对于索引,具有特定键宽度的特定索引的所有分配都将按顺序完成到内存映射板的链接列表中。每当删除树节点或散列条目时,分配集最末尾的记录将被移动到洞中,并且数据结构中的指针将被重新配置为新地址。Blob 存储的管理方式类似,但具有不同大小 Blob 的池。

The first team took on indexes and blob storage. The plan was to remake these data structures in such a way that they never had any “holes” at all. For indexes, all allocations for a specific index with a specific key width would be done sequentially into a linked list of memory-mapped slabs. Whenever a tree node or hash entry was deleted, the record at the very end of the set of allocations would be moved into the hole, and the pointers in the data structure would be reconfigured for the new address. Blob storage was managed similarly, but with pools for various size blobs.

有人担心额外的指针修复会影响性能,但测量表明这并不重要。现在索引和 blob无法形成碎片。这需要花费几个工程师几个月的工程成本,但对产品的性能没有太大影响。

There was a concern that the extra pointer fixups would impact performance, but measurements showed this was not significant. Now indexes and blobs could not fragment. This came at an engineering cost of several engineer-months, but without much performance impact to the product.

元组存储采用了不同的方法。元组将被分配到内存映射板的链接列表中,就像索引数据一样,但删除造成的空洞将被跟踪,而不是被填充。每当孔的数量超过阈值(例如,5%)时,就会启动压缩过程,重新排列元组并合并块。这会将碎片绑定到固定数量,满足 VoltDB 和客户的要求。

Tuple storage took a different approach. Tuples would be allocated into a linked list of memory-mapped slabs, much like index data, but holes from deletion would be tracked, rather than filled. Whenever the number of holes exceeded a threshold (e.g., 5%), a compaction process would be initiated that would rearrange tuples and merge blocks. This would bind fragmentation to a fixed amount, which met the requirements of VoltDB and the customer.

最终,我们没有选出获胜者;我们在不同的地方使用了这两种方案。两个原型都足够了,而且对于早期产品来说,还有许多其他事情需要改进。反碎片工作取得了巨大成功,与其他通常使用内存效率较低的内存存储相比,被认为是 VoltDB 的竞争优势。5如果没有它,就很难在任何生产工作负载中使用 VoltDB。

In the end, we didn’t pick a winner; we used both schemes in different places. Both prototypes were sufficient and with an early product, there were many other things to improve. The anti-fragmentation work was a huge success and is considered a competitive advantage of VoltDB compared to other in-memory stores that often use memory less efficiently.5 Without it, it would be hard to use VoltDB in any production workloads.

这类问题确实可以说明研究与生产之间的鸿沟。

These kinds of problems can really illustrate the gulf between research and production.

事实证明,压缩对于运行 VoltDB 几个小时以上至关重要,但由于研究结果,这一点并未出现。我们之前假设,如果稳态工作负载工作一个小时,它将永远工作,但事实绝对不是这样。

It turns out compaction is critical to running VoltDB for more than a few hours, but this didn’t come up because of the research results. We previously assumed that if a steady state workload worked for an hour, it would work forever, but this is absolutely not the case.

课。内存使用情况应密切跟踪实际存储的数据,并且系统应进行更长时间的测试。

Lesson. Memory usage should closely track the actual data stored, and systems should be tested for much longer periods of time.

潜伏

Latency

VoltDB 数据库的 1.0 版本,就像它所基于的 H-Store 原型一样,使用了基于原始 H-Store 论文中描述的思想的事务排序和共识方案 [Stonebraker 等人,2017]。2007b],但具有额外的安全性。稍微简单一点,节点将收集 5 毫秒周期内的所有候选工作,然后在集群内的所有节点之间交换这 5 毫秒的工作。然后,这项工作将根据类似于 Twitter Snowflake 的计划进行订购。6

Version 1.0 of the VoltDB database, like the H-Store prototype it was based on, used a transaction ordering and consensus scheme that was based on the ideas described in the original H-Store paper [Stonebraker et al. 2007b], but with additional safety. Oversimplifying a bit, nodes would collect all candidate work in a 5 ms epoch and then exchange between all nodes the work inside the cluster for that 5 ms. This work would then be ordered based on a scheme similar to Twitter Snowflake.6

该方案保证了所有已提交交易的总体、全局预购。也就是说,在运行事务之前,其相对于所有其他事务的可序列化顺序是已知的。

This scheme guaranteed a total, global pre-order for all submitted transactions. That is, before a transaction was run, its serializable order with respect to all other transactions was known.

与当代的事务排序方案相比,VoltDB 比两阶段提交提供了更多的容错能力,并且比使用 Paxos 等模式进行排序要简单得多。它还支持比两者都高得多的吞吐量。

Compared to contemporary transaction ordering schemes, VoltDB offered more fault tolerance than two-phase-commit and was dramatically simpler than using a schema like Paxos for ordering. It also supported significantly higher throughput than either.

当工作本身完成时,对所有事务进行全局预排序需要较少的集群节点之间的协调[Stonebraker 等人。2007b]。理论上,参与者有很大的余地来重新排序工作,因此只要它产生的结果有效地等于指定的顺序,就可以更有效地执行它。这是最初的 H-Store 研究的一部分 [Stonebraker 等人。2007b]。

Having a global pre-ordering of all transactions required less coordination between cluster nodes when the work itself was being done [Stonebraker et al. 2007b]. In theory, participants have broad leeway to re-order work, so it can be executed more efficiently, provided it produces results effectively equivalent to the specified order. This was all part of the original H-Store research [Stonebraker et al. 2007b].

那么,有什么问题呢?该方案使用挂钟来排序交易。这意味着交易必须等待最多 5 毫秒才能结束纪元,加上网络往返时间以及任何时钟偏差。在单个数据中心中,网络时间协议 (NTP)能够将时钟同步到大约 1 毫秒,但要正确配置该配置并非易事。网络偏差通常也很低,但可能会受到后台网络副本或垃圾收集等常见事物的影响。

So, what’s the catch? This scheme used wall clocks to order transactions. That meant transactions must wait up to 5 ms for the epoch to close, plus network round trip time, plus any clock skew. In a single data center, Network Time Protocol (NTP) is capable of synchronizing clocks to about 1 ms, but that configuration isn’t trivial to get right. Network skew is also typically low but can be affected by common things like background network copies or garbage collections.

更明确地说,在单节点 VoltDB 实例上,客户端操作即使不做任何实际工作,也至少需要 5 毫秒。这意味着同步基准测试客户端每秒可以执行 200 个琐碎事务,对于大多数工作负载而言,速度比 MySQL 慢得多。

To put it more clearly, on a single-node VoltDB instance, client operations would take at least 5 ms even if it did no actual work. That means a synchronous benchmark client could do 200 trivial transactions per second, substantially slower than MySQL for most workloads.

在集群中,情况更糟。为了评估 VoltDB 而设置好 NTP 是一个绊脚石,尤其是在云的新世界中。这意味着延迟可能为 10-20 毫秒。最初的 VoltDB 论文假设实现时钟同步是微不足道的,但我们发现这是错误的,足以导致问题。我们不仅需要同步时钟,还需要它们能够毫无问题地保持同步数天、数月甚至数年。

In a cluster, it was worse. Getting NTP set up well in order to evaluate VoltDB was a stumbling block, especially in the new world of the cloud. This meant the delay might be 10–20 ms. The original VoltDB paper assumes achieving clock synchronization is trivial, but we found that to be just false enough to cause problems. We didn’t just need synced-clocks, we needed them to stay synced for days, months, or even years without issue.

这些都不会影响吞吐量。VoltDB 客户端在设计上是完全异步的,可以按照响应到达的顺序处理响应。适当的并行工作负载可以在正确的集群上实现每秒数百万个事务,但要求潜在用户构建完全异步的应用程序被证明是一个巨大的挑战。用户不习惯这样的开发方式,改变用户习惯很困难。

None of this affected throughput. The VoltDB client was fully asynchronous by design and could processes responses in the order they arrived. A proper parallel workload could achieve millions of transactions per second on the right cluster, but asking prospective users to build fully asynchronous apps proved too much of a challenge. Users were not used to developing that way and changing user habits is difficult.

VoltDB 需要比 MySQL 更快,而不需要应用程序魔法。

VoltDB needed to be faster than MySQL without application wizardry.

工程团队数月来的分歧和思考最终在一次小型会议上做出了决定。

Many months of disagreement and thought from the engineering team culminated in a small meeting where a decision had to be made.

制定了一个粗略的计划,用后序系统取代 VoltDB 共识,该系统将延迟减少到接近于零,同时保持吞吐量。新系统将限制跨分区事务的一些性能改进(这对于 VoltDB 用例来说通常很少见),并且需要多名工程师工作近一年,这些时间可以花在更明显的功能上。

A rough plan was hashed out to replace VoltDB consensus with a post-order system that would slash latency to near zero while keeping throughput. The new system would limit some performance improvements to cross-partition transactions (which are typically rare for VoltDB use cases) and it would require several engineers working for almost a year, time that could be spent on more visible features.

会议结束后,工程人员决定解决延迟问题。作为该计划的一部分,VoltDB 1.0 共识方案将被保留,但只是为了引导一个由选举的分区领导者组成的新系统,该系统序列化所有每个分区的工作,以及一个确定跨分区顺序的全局跨分区序列化器工作。

Engineering came out of that meeting resolved to fix the latency issues. As part of the plan, the VoltDB 1.0 consensus scheme would be kept, but only to bootstrap a new system of elected partition leaders that serialized all per-partition work and a single, global cross-partition serializer that determined the order of cross-partition work.

该方案在 3.0 版本中推出,由于我们不必为时钟偏差和全对全交换而保留事务,因此平均集群延迟几乎降至零。在网络良好的情况下,典型的响应延迟小于一毫秒。

This scheme was launched with version 3.0, and average cluster latency was reduced to nearly nothing now that we did not have to hold transactions for clock skew and the all-to-all exchange. Typical response latencies were less than a millisecond with a good network.

这直接导致了 VoltDB 在广告技术和个性化等低延迟行业中的使用。

This directly led to VoltDB use in low-latency industries like ad-tech and personalization.

课。响应时间与吞吐量同样重要。

Lesson. Response time is as important as throughput.

磁盘持久化

Disk Persistence

当 VoltDB 推出时,高可用性的故事是通过集群实现 100% 冗余。由于存在定期磁盘快照,因此只有在丢失多个节点时您才会看到数据丢失,并且可能只会丢失几分钟的最新数据。争论的焦点是服务器更加可靠,并且每台机器的 UPS(不间断电源)越来越普遍,因此不太可能发生多次故障。

When VoltDB launched, the high-availability story was 100% redundancy through clustering. There were periodic disk snapshots, so you would see data loss only if you lost multiple nodes, and then you might only lose minutes of recent data. The argument was that servers were more reliable, and per-machine UPSs (uninterruptive power supplies) were increasingly common, so multiple failures weren’t a likely occurrence.

争论没有落地。

The argument didn’t land.

VoltDB 技术营销和销售花费了太多时间来反驳 VoltDB 无法保证数据安全的想法。竞争对手强化了这种说法。2011 年初,磁盘持久性的缺乏严重限制了客户的增长。

VoltDB technical marketing and sales spent too much time countering the idea that VoltDB wouldn’t keep your data safe. Competitors reinforced this narrative. In early 2011, it got to the point where lack of disk persistence was severely limiting customer growth.

VoltDB 需要每事务磁盘持久性,同时又不影响其众所周知的性能。最初的 H-Store/VoltDB 论文的一部分是,当传统 RDBMS 迁移到内存时,日志记录是阻碍其发展的因素之一 [Harizopoulos 等人,2017]。2008],所以这是一个相当大的挑战。

VoltDB needed per-transaction disk persistence without compromising the performance it was known for. Part of the original H-Store/VoltDB thesis was that logging was one of the things holding traditional RDBMSs back when they moved to memory [Harizopoulos et al. 2008], so this posed quite a challenge.

为了解决这个问题,工程部向 VoltDB 添加了快照间日志,但打破了传统 RDBMS 使用的 ARIES(利用语义的恢复和隔离算法)样式日志。VoltDB 已经严重依赖确定性和操作的逻辑描述来在节点之间进行复制。工程部门选择利用这项工作将逻辑日志写入磁盘,描述过程调用和 SQL 语句,而不是变异数据。

To address this problem, Engineering added an inter-snapshot log to VoltDB but broke with the ARIES (Algorithms for Recovery and isolation Exploiting Semantics) style logs used by traditional RDBMSs. VoltDB already heavily relied on determinism and logical descriptions of operations to replicate between nodes. Engineering chose to leverage that work to write a logical log to disk that described procedure calls and SQL statements, rather than mutated data.

这种方法对于 VoltDB 来说具有巨大的技术优势。一旦为给定分区订购了事务(但在执行之前),它们就可以写入磁盘。这意味着磁盘写入和实际计算可以同时完成。两者完成后,即可向调用者确认交易。其他系统执行操作,然后将二进制更改日志写入磁盘。逻辑方法和 VoltDB 实现意味着磁盘持久性不会对吞吐量产生重大影响,而且对延迟的影响也很小。

This approach had a huge technical advantage for VoltDB. As soon as transactions were ordered for a given partition (but before they were executed), they could be written to disk. This meant disk writes and the actual computation could be done simultaneously. As soon as both were completed, the transaction could be confirmed to the caller. Other systems performed operations and then wrote binary change-logs to disk. The logical approach and VoltDB implementation meant disk persistence didn’t have substantial impact on throughput, and only minimal impact on latency.

2011 年秋季,VoltDB 2.5 中添加了每事务磁盘持久性,几乎立即平息了对 VoltDB 基于持久性的批评。很明显,如果没有这个功能,VoltDB 的使用将会受到更多限制。

Per-transaction disk-persistence was added in VoltDB 2.5 in Fall 2011 and almost immediately silenced persistence-based criticism of VoltDB. It’s clear that without this feature, VoltDB would have seen much more limited use.

作为附录,我们今天有更多关于 VoltDB 完全集群故障的常见情况的数据。运行良好的 VoltDB 实例的集群故障很少见,但并非总是 100% 不可避免,而且并非所有 VoltDB 集群都运行良好。磁盘持久性这个功能不仅消除了批评,而且时不时地被用户使用。

As an addendum, we have a lot more data today about how common complete cluster failure is with VoltDB. Cluster failures for well-run VoltDB instances are rare, but not always 100% unavoidable, and not all VoltDB clusters are well run. Disk persistence is a feature that not only cut off a line of criticism, but also gets exercised by users from time to time.

课。人们不信任内存系统作为记录系统。

Lesson. People don’t trust in-memory systems as system of record.

延迟减少

Latency Redux

2013 年,在 VoltDB 将平均延迟降低到零的一年内,VoltDB 受到了一家主要电信 OEM(原始设备制造商)的青睐,希望在其堆栈中取代 Oracle。Oracle 的定价使他们很难与新兴的亚洲供应商竞争,这些供应商在没有 Oracle 的情况下构建了自己的堆栈,而且 Oracle 的部署模型不太适合虚拟化和数据中心编排。

In 2013, within a year of reducing average latency in VoltDB to nil, VoltDB was courted by a major telecommunications OEM (original equipment manufacturer) looking to replace Oracle across their stack. Oracle’s pricing made it hard for them to compete with upstart Asian vendors who had built their stacks without Oracle, and Oracle’s deployment model was poorly suited to virtualization and data-center orchestration.

取代甲骨文将大大提高竞争力。

Replacing Oracle would be a substantial boost to competitiveness.

在 OEM 的 VoltDB 评估过程中,延迟很快成为一个问题。虽然平均延迟满足要求,但长尾延迟未满足要求。对于典型的呼叫授权应用程序,服务级别协议可能规定,任何未在 50 毫秒内做出的决定都不能向客户计费,从而迫使授权提供商支付呼叫费用。

During the OEM’s VoltDB evaluation, latency quickly became an issue. While average latency met requirements, long tail latency did not. For a typical call authorization application, the service level agreement might dictate that any decision not made in 50 ms can’t be billed to the customer, forcing the authorization provider to pay the call cost.

VoltDB 创建了一个新的自动化测试来测量长尾延迟。工程部门不是测量平均延迟或测量常见的第 99 个百分位甚至第 99.999 个百分位,而是专门计算给定窗口中花费时间超过 50 毫秒的事务数量。我们的目标是在我们实验室的长期运行中将该数字减少到零,以便客户可以在其部署中支持 50 毫秒以下的 P99.999 延迟。

VoltDB created a new automated test to measure long tail latency. Rather than measure average latency or measure at the common 99th percentile or even the 99.999th percentile, Engineering set out to specifically count the number of transactions that took longer than 50 ms in a given window. The goal was to reduce that number to zero for a long-term run in our lab so the customer could support P99.999 latency under 50 ms in their deployments.

一旦你开始衡量正确的事情,问题就基本解决了,但仍然需要编写代码。我们将更多的统计数据收集和运行状况监控代码移出了阻塞路径。我们改变了对象的分配和使用方式,几乎消除了停止世界垃圾收集事件的需要。我们还调整了缓冲区大小和 Java 虚拟机参数,以使一切运行良好且“无聊”。

Once you start measuring the right things, the problem is mostly solved, but there was still code to write. We moved more of the statistics collection and health monitoring code out of blocking paths. We changed how objects were allocated and used to nearly eliminate the need for stop-the-world garbage collection events. We also tuned buffer sizes and Java virtual machine parameters to get everything running nice and “boring.”

如果说 VoltDB Engineering 在十年的开发过程中学到了一件事,那就是客户希望他们的操作数据库尽可能无聊且不令人惊讶。这是关闭第一个主要电信客户的最后一块拼图,更多的客户紧随其后。如今,全球很大一部分移动通话和短信都是通过基于 VoltDB 的系统进行授权的。

If there’s one thing VoltDB Engineering learned over the course of ten years of development, it’s that customers want their operational databases to be as boring and unsurprising as possible. This was the final piece of the puzzle that closed the first major telecommunications customer, with more coming right on their heels. Today, a significant portion of the world’s mobile calls and texts are authorized through a VoltDB-based system.

课。P50 是一个不好的衡量标准 - P99 更好 - P99.999 是最好的。

Lesson. P50 is a bad measure—P99 is better—P99.999 is best.

结论

Conclusion

当然,这里描述的事件只是我们在将 VoltDB 构建成当今成熟且值得信赖的系统过程中遇到的挑战和冒险的一小部分。从研究论文到原型,再到 1.0,再到部署在世界各地的强大平台,构建一个系统是一种无与伦比的学习体验。

Of course, the incidents described here are just a tiny sliver of the challenges and adventures we encountered building VoltDB into the mature and trusted system it is today. Building a system from a research paper, to a prototype, to a 1.0, and to a robust platform deployed around the world is an unparalleled learning experience.

1 . 有关 H-Store 的更多信息,请参阅第 19 章:H-Store/VoltDB。

1. For more on H-Store see Chapter 19: H-Store/VoltDB.

2 . 有关合作者列表,请参阅https://dl.acm.org/itation.cfm?id=1454211 。

2. See https://dl.acm.org/citation.cfm?id=1454211 for the list of collaborators.

3 . 有关 Vertica 的更多信息,请参阅第18 章和第 27章。

3. For more on Vertica see Chapters 18 and 27.

4 . 压缩对于运行 VoltDB 超过几个小时至关重要,但在最初的设计或研究中并未出现,因为学术界并不总是按照生产中的方式运行事物。它最终成为成功的关键。

4. Compaction, which is critical to running VoltDB for more than a few hours, didn’t come up in the initial design or research because academics don’t always run things the way one might in production. It ended up being critical to success.

5 . 竞争追赶是一个很长的故事。大多数系统无法完成 VoltDB 所做的事情,因为它们使用共享内存多线程甚至无锁或无等待数据结构。这些更难压缩。其他系统可以使用 TCMalloc 或 JEMalloc,因为它们不嵌入 JVM。

5. The competition catch-up is a long story. Most systems can’t do what VoltDB does because they use shared-memory multi-threading and even lock-free or wait-free data structures. These are much harder to compact. Other systems can use TCMalloc or JEMalloc because they don’t embed the JVM.

6 . “宣布 Snowflake”,Twitter 博客,2010 年 6 月 1 日。https: //blog.twitter.com/engineering/en_us/a/2010/announcing-snowflake.html。上次访问时间为 2018 年 3 月 29 日。

6. “Announcing Snowflake,” the Twitter blog, June 1, 2010. https://blog.twitter.com/engineering/en_us/a/2010/announcing-snowflake.html. Last accessed March 29, 2018.

29

29

SciDB 代码线:跨越鸿沟

The SciDB Codeline: Crossing the Chasm

克里蒂·森·夏尔马、亚历克斯·波利亚科夫、杰森·金钦

Kriti Sen Sharma, Alex Poliakov, Jason Kinchen

迈克是一名数据库专家,因此他的学术创新被延伸出来并贴上数据库产品的标签。但与 PostgreSQL、Vertica 和 VoltDB 相比,SciDB 是一种完全不同的野兽:它模糊了数据库和 HPC 之间的界限。作为一种用于科学应用的计算数据库,它实际上是一个紧密架构的软件包中的两种产品:分布式数据库和大规模并行处理 (MPP)、弹性可扩展的分析引擎。除了支持新型高效存储和可访问的 n 维数据模型外,混合设计使开发面临三重挑战。作为一支精干的团队,在早期就面临着创造收入的巨大压力,我们1必须在长期愿景和短期可交付成果之间做出许多艰难的权衡。有些成功了,有些则成功了。有些必须被撕掉。以下是攀登 Mike 的 SciDB 山的一些故事,与 Paul Brown 的故事情节“攀登山脉:SciDB 和科学数据管理”(请参阅​​第 20 章)一致

Mike’s a database guy, so his academic innovations get spun out and branded as database products. But SciDB—in contrast to PostgreSQL, Vertica, and VoltDB—is a rather different kind of beast: one that blurs the line between a database and HPC. As a computational database for scientific applications, it’s actually two products in one tightly architected package: a distributed database and a massively parallel processing (MPP), elastically scalable analytics engine. Along with its support for a new kind of efficiently stored and accessible n-dimensional data model, the hybrid design made development triply challenging. As a lean team under intense pressure to produce revenue early on, we1 had to make many tough tradeoffs between long-term vision and short-term deliverables. Some worked out; some had to be ripped out. Here are a few tales from the trek up Mike’s SciDB mountain, in keeping with Paul Brown’s story line, “Scaling Mountains: SciDB and Scientific Data Management” (see Chapter 20).

与他人相处融洽

Playing Well with Others

SciDB 附带了一套广泛的本机分析功能,包括 ScaLAPACK——由 James McQueston 巧妙地集成到 SciDB 中的可扩展线性代数。但考虑到 SciDB 的使命范围——使科学家和数据科学家能够大规模运行其先进的阵列数据工作流程——不可能提供涵盖无数潜在用例的所有功能。核心设计决策之一允许用户添加用户定义的类型、函数和以与其他数据库非常相似的方式聚合(分别为 UDT、UDF 和 UDA)。SciDB 更进一步,支持更强大的用户定义运算符 (UDO) 抽象。事实证明,这一决定是成功的,因为许多用户定义的扩展 (UDX) 都是由 SciDB 用户开发的。值得注意的例子包括 NASA 对 MERRA 卫星图像数据在空间和时间上与地面传感器数据对齐的连接组件标签,以识别风暴 [Oloso 等人,2017]。2016],以及在 SciDB 中运行的 GPU 加速卷积用于太阳耀斑检测 [Marcin 和 Csillaghy 2016]。

SciDB ships with an extensive set of native analytics capabilities including ScaLAPACK—scalable linear algebra neatly integrated into SciDB by James McQueston. But given the scope of SciDB’s mission—enabling scientists and data scientists to run their advanced array data workflows at scale—it would be impossible to provide all the functionality to cover the myriad of potential use cases. One of the core design decisions allowed users to add user-defined types, functions, and aggregates (UDTs, UDFs, and UDAs, respectively) in a fashion very similar to other databases. SciDB went one step further by supporting a much more powerful user-defined operator (UDO) abstraction. This decision proved to be successful as many user-defined extensions (UDXs) were developed by users of SciDB. Noteworthy examples include NASA’s connected-component labeling of MERRA satellite image data spatially and temporally aligned with ground-based sensor data to identify storms [Oloso et al. 2016], and a GPU-accelerated convolution running in SciDB for solar flare detection [Marcin and Csillaghy 2016].

然而,由于许多客户已经使用 R/Python/MATLAB(MaRPy 语言)进行分析编码,因此仅依赖 UDX 方法的局限性变得显而易见。他们求助于 SciDB 来扩大工作规模,以便在更大的数据集上运行相同的算法,或者使用弹性计算资源在更短的时间内执行昂贵的计算。但他们不想承担将其算法重新实现或重新验证为 UDX 的开发成本。此外,编写 UDX 需要对 SciDB 架构的细节有足够的了解(例如,平铺模式、分块策略、SciDB 数组 API 等)。很多时候,我们采访的研究人员心里都有一个“最喜欢的包”,要求我们帮助运行这个包打包 TB 级数据。我们意识到,虽然 UDX 是定制 SciDB 的强大方式,但市场需要更快地开发、迭代和部署。

However, the limitations of relying solely on a UDX approach became apparent as many customers already had their analytics coded in R/Python/MATLAB (MaRPy languages). They turned to SciDB to scale up their work to run the same algorithm on larger datasets or to execute an expensive computation in a shorter amount of time using elastic computing resources. But they did not want to incur the development costs to re-implement or revalidate their algorithms as UDXs. Moreover, writing UDXs required sufficient understanding about details of the SciDB architecture (e.g., tile-mode, chunking strategy, SciDB array-API, etc.). Quite often, researchers we spoke to had a “favorite package” in mind, asking us to help run exactly that package on terabytes of data. We realized that while UDXs were a powerful way to customize SciDB, the market needed to develop, iterate, and deploy faster.

2016 年春天,Bryan Lewis 和 Alex Poliakov 决定着手解决这个问题。由于潜在客户的演示日期迫在眉睫,他们面临着巨大的时间压力。由此开始了为期三周的协作编程狂潮,布莱恩和亚历克斯首先勾画出设计草图,分配工作,并制定了一个可行的实施方案。

In the Spring of 2016, Bryan Lewis and Alex Poliakov decided to take this up. They were under intense time pressure as the demonstration date for a prospect was looming. Thus began a three-week long collaborative programming frenzy in which Bryan and Alex first sketched out the design, divvied up the work, and produced a working implementation.

SciDB 流式传输的整体架构类似于 Hadoop 流式传输或 Apache RDD.pipe。他们选择的实现是使用管道:标准输入和标准输出来进行数据传输。SciDB 块(已经是一个格式良好的数据段,小到足以适合内存)将被用作数据传输的单位。现有的 SciDB 运算符用于在实例之间移动数组,以实现“减少”或“汇总”类型的工作流程。第一个实现附带了一个自定义 R 包,专门为 R 提供了一个简单的 API。

The overall architecture of SciDB streaming is similar to Hadoop streaming or the Apache RDD.pipe. The implementation they chose was to use pipes: standardin and standard-out for data transfer. The SciDB chunk—already a well-formed segment of data that was small enough to fit in memory—would be used as the unit of data transfer. Existing SciDB operators were used to move arrays between instances for “reduce” or “summarize” type workflows. The first implementation shipped with a custom R package offering an easy API to R specifically.

客户立即喜欢上了这种功能。人们现在可以将 SciDB 与 MaRPy 语言和 C++ 的大量开源库连接起来。SciDB 流式传输通过向每个实例提供数据并协调并行性,使用户无需进行管道操作。随后,SciDB 客户解决方案架构师 Jonathan Rivers 和 Rares Vernica 添加了专门的 Python 支持以及与 Apache Arrow 的集成。SciDB 流现在是 SciDB 用户工具包的重要组成部分。

Customers liked this capability immediately. One could now connect SciDB with the vast universe of open-source libraries in MaRPy languages and C++. SciDB streaming spares the user from doing the plumbing by serving up the data to each instance and orchestrating the parallelism. Specialized Python support and integration with Apache Arrow was subsequently added by SciDB customer solutions architects Jonathan Rivers and Rares Vernica. SciDB streaming is now an important part of the SciDB user’s toolkit.

正如迈克喜欢说的那样,一个数据库并不能适应所有情况。同样,一种数据分布格式并不适合高性能计算的所有算法。因此,James McQueston 正在增加对以备用分布保存数据的支持(例如用于矩阵运算的“块循环”和用于某些类型连接的“复制”),以便通过在分布式系统中提高更广泛的算法集的效率来提高性能。这将促进令人尴尬的并行(流)和非令人尴尬的并行执行,例如大规模线性代数。

As Mike likes to say, one database does not fit all. Similarly, one data distribution format does not fit all algorithms for high-performance computing. So, James McQueston is adding support to preserve data in alternate distributions—like “blockcyclic” for matrix operations and “replicated” for certain kinds of joins—to notch up performance by making broader sets of algorithms more highly efficient in a distributed system. This will boost both embarrassingly parallel (streaming) and non-embarrassingly parallel execution, such as large-scale linear algebra.

SciDB 流媒体目前已成功部署在一家排名前 5 的制药公司的可穿戴设备项目中,其中对来自 13 个不同设备的 63 个数据流中的 7 TB 数据进行了时间对齐和分析,并为斯坦福全球生物银行引擎评估了基因级效应英国生物银行基因型-表型关联数据的模型。

SciDB streaming is currently deployed successfully for a wearables project at a top 5 pharma company where 7 TB of data from 63 streams of data from 13 different devices are time-aligned and analyzed, and for the Stanford Global Biobank Engine to evaluate gene-level effect models on the UK Biobank genotype-phenotype association data.

你不可能(一次)拥有一切

You Can’t Have Everything (at Once)

在实现弹性、可扩展、多维、多属性数组 MPP 分布式数据库时,Paradigm4 团队很快意识到这是一个非常未知的领域,需要我们从头开始构建许多组件。在这种情况下,重要的是要选择首先关注哪些功能,以及哪些功能要推迟到产品路线图的后期。虽然具有 ACID 属性的多维数组数据库从很早就开始是开发重点,但决定从一开始就设计全面的弹性支持,但分阶段推出。

When implementing an elastic, scalable, multi-dimensional, multi-attribute array MPP distributed database, the Paradigm4 team quickly realized that this was very much uncharted territory, requiring us to build many components from the ground up. In such a scenario, it was important to pick out which features to focus on first, and which capabilities to defer until later in the product roadmap. While a multidimensional array database with ACID properties had been the development focus from very early on, it was decided that full elasticity support would be designed in from the start, but rolled out in phases.

SciDB 的最早实现并未完全遵循无共享架构2原则。用户查询必须发送到特殊的协调器节点,而不是集群中的任何实例。这并不理想,并且还在系统中引入了单点故障。这一最初的妥协简化了编码工作,对于大多数仅通过一个实例查询 SciDB 的早期用户来说已经足够好了。

The very earliest implementations of SciDB did not fully adhere to shared-nothing architecture2 principles. User queries had to be sent to a special coordinator node rather than to any instance in a cluster. This was not ideal, and also introduced a single point of failure in the system. This initial compromise simplified the coding effort and was good enough for most of the early users who queried SciDB only via one instance.

2015年底左右,完成了真弹性的实施。现在,集群中的所有实例都可以接受传入查询——不存在单个协调器实例(尽管术语“协调器”在 SciDB 用户中仍然普遍使用)。更重要的是,实例可以上线并随时离线,因为 SciDB 可以检测节点故障并仍然保持正常运行。这需要由 Igor Tarashansky 和 ​​Paul Brown 领导的重大架构变革。

Around the end of 2015, the implementation of true elasticity was completed. Now, all instances in a cluster could accept incoming queries—there was no single coordinator instance (even though the term “coordinator” continues to be commonly used among SciDB users). More importantly, instances could go online and offline at any moment as SciDB could detect node failures and still keep functioning properly. This required significant architectural changes which were led by Igor Tarashansky and Paul Brown.

新的实施消除了一些单点故障的风险。这一架构变化还支持另一个重要功能:ERG,即弹性资源组。随着客户安装的数据量或计算需求的增长,弹性和 ERG 允许用户永久向集群添加更多实例(例如,以处理更多存储),或根据需要(例如,仅在运行昂贵的计算时)常规)。

The new implementation removed some of the exposure to the single point of failure. That architectural change also supported another important capability: ERGs, or elastic resource groups. As customer installations grew in data volume or computational needs, elasticity and ERGs allow users to add more instances to the cluster either permanently (e.g., to handle more storage), or on an as-required basis (e.g., only while running an expensive computation routine).

在商业和生活中,一个人不可能同时拥有一切。但凭借时间、坚持和明确的重点,我们能够实现我们早期计划的许多产品目标。

In business as in life, one cannot have everything at once. But with time, persistence, and a clear focus, we are able to deliver many of the product goals that we had planned early on.

我们相信硬性数字

In Hard Numbers We Trust

SciDB 上的典型查询涉及大数据和小数据。例如,用户可能希望从大型临床试验数据集中检索 50-65 岁患者某些基因座的基因表达值。为了从较大的基因表达数组中分割出适当的“子数组”,必须从至少三个其他元数据数组(对应于研究、患者和基因组)中推断出必要的索引。SciDB 在从多 TB 阵列中切片数据方面速度非常快 — 毕竟,这就是 SciDB 的设计和优化目的。然而,事实证明,最初的设计和实现并没有充分关注优化“较小”数组上的查询延迟。

Typical queries on SciDB involve both large and small data. For example, a user might want to retrieve the gene-expression values at certain loci for patients aged 50–65 from within a large clinical trial dataset. To slice the appropriate “sub-array” from the larger gene-expression array, one must infer necessary indices from at least three other metadata arrays (corresponding to sets of studies, patients, and genes). SciDB is remarkably fast at slicing data from multi-TB arrays—after all, this is what SciDB was designed and optimized for. However, it turned out that the initial design and implementation had not focused sufficiently on optimizing query latency on “smaller” arrays.

最初,我们认为性能较慢可能是因为可以安装在一台机器上的小型阵列与必须分布在多台机器上的大型阵列被视为相同。我们假设不必要的重新分配步骤会减慢查询速度。如果是这种情况,工程团队认为需要对代码库进行大量重写才能实现显着的加速。由于客户交付成果紧迫,我们暂时推迟了解决这一特定性能缺陷的问题。

Initially, we thought this slower performance might be because small arrays that could fit on one machine were being treated the same as large arrays that had to be distributed across multiple machines. We hypothesized that an unnecessary redistribution step was slowing down the query. If that were the case, the engineering team thought a sizable rewrite of the codebase would be required to achieve a significant speedup. With immediate customer deliverables pressing, we deferred tackling this specific performance shortfall for a while.

在一个单独的项目中,James McQueston 花费了时间为 SciDB 构建强大且非常精细的分析工具。他的工程格言是,开发人员在设计任何理论来解释为什么某些东西运行得如此慢(或如此快)之前,必须先查看硬数据。该分析工具之前在某些其他优化场景中很有用。然而,当揭示小数组查询缓慢的实际原因时,它的真正价值就显现出来了。戴夫·戈塞林和詹姆斯在分析环境中精心复制匿名客户数据和查询。有了这些工具,他们发现主要瓶颈并不是每个人都怀疑的重新分配步骤。相反,延迟是由于系统目录查询效率低下造成的,这种查询在产品历史上很久以前就已经实现了,但从未被重新访问过。这个修复相对容易。

In a separate project, James McQueston had spent time building powerful and very granular profiling tools for SciDB. His engineering mantra was that developers must look at hard numbers before devising any theory about why something goes as slow (or as fast) as it does. The profiling tool had previously been useful in certain other optimization scenarios. However, its true value shone when revealing the actual cause of the slowness on small-array queries. Dave Gosselin and James painstakingly replicated anonymized customer data and queries in the profiling environment. Armed with those tools, they discovered that the major bottleneck was not the redistribution step that everyone suspected. Instead, the delay was caused by inefficient querying of the system catalog, something that was implemented eons ago in the product’s history but had never been revisited. This fix was relatively easy.

我们最初的假设是错误的;仔细的性能分析为我们指出了正确的根本原因。令人高兴的是,我们能够显着提高小数组查询的速度,而无需对代码库进行重大重写。

Our initial hypothesis had been incorrect; careful performance analysis pointed us to the correct root cause. Happily, we were able to achieve significant speedup for the small-array queries without a major rewrite of the codebase.

语言问题

Language Matters

科学家和数据科学家更喜欢 R 和 Python 等统计分析语言,而不是 SQL。我们知道,如果我们希望这些人开始使用我们的系统,用这些语言编写 SciDB 的直观界面非常重要。因此,我们开发了 SciDB-R 和 SciDB-Py 分别作为 SciDB 的 R 和 Python 连接器。人们可以使用这些连接器向 SciDB 上的多 TB 数据发送查询,同时还可以利用他们最喜欢的编程语言的编程灵活性和庞大的开源库。如今,几乎所有在生产中使用 SciDB 的客户都使用这些接口之一来使用 SciDB。

Scientists and data scientists prefer statistical analysis languages like R and Python, not SQL. We knew that writing intuitive interfaces to SciDB in these languages would be important if we wanted these folks to start using our system. Thus, we developed SciDB-R and SciDB-Py as R and Python connectors to SciDB, respectively. One could use these connectors to dispatch queries to the multi-TB data on SciDB while also utilizing the programming flexibility and vast open-source libraries of their favorite programming language. Today, almost all customers who use SciDB in production use one of these interfaces to work with SciDB.

这两个连接器库的历程反映了我们对用户体验不断发展的理解。这些连接器的最初实现(Bryan Lewis 的 SciDB-R 以及 Jake VanderPlas 和 Chris Beaumont 的 SciDB-Py)具有共同的理念,即我们应该重载目标语言来编写/编译 SciDB 查询。例如,R 的子集功能被重写,以在下面调用 SciDB 运算符进行过滤。在 SciDB-Py 包中,Pythonic 语法如转换重塑的实现实际上调用了下面不同名称的 SciDB 运算符。我们认为这种 R/Python 级别的抽象会受到用户的青睐。然而,当我们对客户进行培训时,我们发现用户体验并不理想,尤其是对于高级用户。一些极端情况无法被覆盖——抽象层并不总是产生最优的 SciDB 查询。

The journey of these two connector libraries reflects our evolving understanding of the user experience. Initial implementations of these connectors (SciDB-R by Bryan Lewis and SciDB-Py by Jake VanderPlas and Chris Beaumont) shared the common philosophy that we should overload the target language to compose/compile SciDB queries. For example, R’s subset functionality was overridden to call SciDB operators between and filter underneath. In the SciDB-Py package, Pythonic syntax like transform and reshape were implemented that actually invoked differently named SciDB operators underneath. We thought that this kind of R- / Python- level abstraction would be preferred by the users. However, when we trained customers, we found that the user experience was not ideal, especially for advanced users. Some corner cases could not be covered—the abstracted layer did not always produce the most optimal SciDB query.

最近对 SciDB-R (Bryan Lewis) 和 SciDB-Py (Rares Vernica) 进行了完全重写,其中接口语言更接近地映射到 SciDB 的内部数组功能语言 (AFL)。然而,我们的结论是,无法解决抽象层或翻译层的语言问题。相反,我们现在投入时间和精力通过重新设计来改善用户体验SciDB AFL 语言本身。我们在这方面取得了重大进展:张东辉将三个运算符(filter、 Betweencross_ Between)统一为更加用户友好的过滤语法;Mike Leibensperger 引入了自动分块功能,使用户不必考虑物理细节。更多语言改进正在计划中,以改善用户体验并使 SciDB 的语言与 R 和 Python 更加紧密地结合起来。

A complete rewrite of SciDB-R (Bryan Lewis) and SciDB-Py (Rares Vernica) was carried out recently where the interface language mapped more closely to SciDB’s internal array functional language (AFL). However, we concluded that one cannot resolve language issues in the abstraction or translation layer. Instead we are now investing the time and effort to improve the user experience by redesigning the SciDB AFL language itself. We have made significant progress on this front: Donghui Zhang unified three operators (filter, between, and cross_between) into a more user-friendly filter syntax; Mike Leibensperger introduced auto-chunking to spare the user from having to think about physical details. More language improvements are on the roadmap to improve the user experience and to align SciDB’s language more closely with R and Python.

安全是一个持续的过程

Security is an Ongoing Process

我们对 SciDB 安全性的第一遍涉及通过命名域或“命名空间”分离数组(例如,用户 A 和 B 可能有权访问命名空间NS_1中的数组,但只有 A 有权访问NS_2中的数组)。访问权限已由 SciDB 本地授权。

Our first-pass on security for SciDB involved separation of arrays via named domains or “namespaces” (e.g., users A and B might have access to arrays in namespace NS_1, but only A has access to arrays in NS_2). Access was authorized locally by SciDB.

这种安全实施是第一步,我们知道我们必须改进它。当我们的一位制药客户需要行业标准授权协议(例如 LDAP [轻量级直接访问协议])和更细粒度的访问控制(例如数百项临床研究中每个用户的不同访问控制权限)时,机会来了。

This implementation of security was a first pass and we knew we would have to improve it. The opportunity came when one of our pharma customers required an industry-standard authorization protocol (e.g., LDAP [Lightweight Direct Access Protocol]), and more fine-grained access control such as different access control privileges for each user across hundreds of their clinical studies.

两个短期安全项目同时剥离。研究级访问控制的实施是由 Rares Vernica 和 Kriti Sen Sharma 使用 SciDB 的 UDX 功能实现的。SciDB-R 和 SciDB-Py 还需要对管道进行重大更改。同时,与可插入身份验证模块(PAM,安全授权控制的行业标准)的集成使 SciDB 能够利用 LDAP。Mike Leibensperger 重写了整个安全实施以实现这些新功能,我们因此赢得了合同。

Two short-term security projects were spun off simultaneously. The implementation of study-level access control was implemented by Rares Vernica and Kriti Sen Sharma using SciDB’s UDX capability. SciDB-R and SciDB-Py also required major plumbing changes. In tandem, integration with pluggable authentication modules (PAM, an industry standard for secure authorization control) enabled SciDB’s utilization of LDAP. Mike Leibensperger rewrote the entire security implementation to achieve these new capabilities, with which we won the contract.

尽管“安全是一个持续的过程”是陈词滥调,但我们理解这句话的真实性。我们还意识到,在这个领域我们还有更多的工作要做,并开始不断改进。

Despite “security is an ongoing process” being a cliché, we understand the truth of the statement. We also realize that there is more ground to cover for us in this realm and are embarked on continuous improvements.

为(基因组)数据洪流做好准备

Preparing for the (Genomic) Data Deluge

基因组数据的洪流早已预料到[Schatz and Langmead 2013]。随着低于 1,000 美元的 DNA 测序的出现、个性化医疗的推动以及对发现和验证新疗法以提供更有效结果的推动,越来越多的人通过英国生物银行计划等国家化数据收集工作来对自己的 DNA 进行测序、百万退伍军人计划和 AllofUs 计划。在营销和销售 SciDB 时,我们听到竞争对手声称他们已经有能力处理数据在这些规模上。我们的工程团队决定正面解决这些问题,并提供有关 SciDB 目前提供这种可扩展性的能力的硬数据。

The deluge of genomic data has been long anticipated [Schatz and Langmead 2013]. With the advent of sub-$1,000 DNA sequencing, the drive toward personalized medicine, and the push for discovering and validating new therapies to deliver more effective outcomes, more and more people are getting their DNA sequenced via nationalized data collection efforts like the UK Biobank program, the 1 Million Veterans program, and the AllofUs program. While marketing and selling SciDB, we heard claims from competitors that they were already capable of handling data at these scales. Our engineering team decided to tackle these claims head on and to produce hard data about SciDB’s ability to deliver that scalability today.

我们的工程副总裁 Jason Kinchen 与咨询工程师 James McQueston 使用公共领域提供的基因组学数据生成了 100K、250K、500K 和 1M 外显子组(基因组的编码区域)并将其加载到 SciDB 中。我们展示了常见的基因组学查询与数据大小呈线性关系。例如,给定一组特定的基因组坐标,搜索与区域重叠的变体(包括长插入和删除)在 100K 外显子数据集上需要 1 秒,在 500K 数据集上需要 5 秒。得益于 SciDB 的多维索引功能,增长呈线性。此外,到达“500K 点”后,添加更多 SciDB 实例将加快查询速度。除了通过证明 SciDB 的可扩展性的硬数据极大地增强营销信息之外,这项基准工作还指出了显着提高绩效的机会,其中许多问题已得到解决。更多计划正在计划中。

Using genomics data available in the public domain, our VP of Engineering Jason Kinchen, along with consulting engineer James McQueston, generated and loaded 100K, 250K, 500K, and 1M exomes (the coding regions of the genome) into SciDB. We showed that common genomics queries scaled linearly with the size of the data. For example, given a particular set of genomics coordinates, the search for variants overlapping the regions, including long insertions and deletions, took 1 sec on a 100K exome dataset and 5 sec on a 500K dataset. The growth was linear, thanks to SciDB’s multidimensional indexing capabilities. Further, having arrived at the “500K point,” adding more SciDB instances would speed the query up. Apart from greatly enhancing the marketing message with hard data proving SciDB’s scalability, this benchmark work also pointed to opportunities for significant performance improvements, many of which have been tackled. More are planned.

通过使用真实用例的基准设置测试脚本,我们的工程团队对每个软件版本进行回归测试和分析。这种对现实世界性能和可用性的不懈关注是我们开发文化的基础,并使我们能够专注于改善客户的体验及其用例。

By setting up test scripts with benchmarks from real-world use cases, our engineering team regression-tests and profiles each software release. This relentless focus on real-world performance and usability is fundamental to our development culture and keeps us focused improving our customers’ experience and their use cases.

跨越鸿沟:从早期采用者到早期大众

Crossing the Chasm: From Early Adopters to Early Majority

Paradigm4 开发并提供核心 SciDB 软件的开源社区版(可在 Affero GPL 许可证下从http://forum.paradigm4.com/获取)和企业版(Paradigm4 许可证),增加对更快数学的支持、高可用性、复制、弹性、系统管理工具和用户访问控制。

Paradigm4 develops and makes available both an open-source Community Edition of the core SciDB software (available at http://forum.paradigm4.com/ under an Affero GPL license) and an Enterprise Edition (Paradigm4 license) that adds support for faster math, high availability, replication, elasticity, system admin tools, and user access controls.

虽然 Mike 为科学计算数据库设定了宏伟的愿景,该数据库将为 n 维、多样化数据集提供存储、集成和计算开辟新天地,但他留下了许多工程细节作为 Paradigm4 开发团队的练习。这是一次雄心勃勃且充满挑战的攀登,特别是考虑到我们正在创建一个新的产品类别,需要技术创新和传教士式销售。Mike 仍然积极参与 SciDB 和 Paradigm4,在整个过程中提供了无与伦比且宝贵的指导。但产品和公司的重要经验教训——新功能、性能增强、改善用户体验——都是根据我们销售和解决实际客户应用程序的经验得出的。

While Mike set a grand vision for a scientific computational database that would break new ground providing storage, integration, and computing for n-dimensional, diverse datasets, he left many of the engineering details as an exercise for the Paradigm4 development team. It has been an ambitious and challenging climb, especially given that we are creating a new product category, requiring both technical innovations and missionary selling. Mike remains actively engaged with SciDB and Paradigm4, providing incomparable and invaluable guidance along the trek. But the critical lessons for the product and the company—new features, performance enhancements, improving the user experience—have come in response to our experiences selling and solving real customer applications.

1 . 本章提到的作者及其同事来自 Paradigm4,该公司是开发和销售 SciDB 的公司。

1. The authors and their colleagues mentioned in this chapter are with Paradigm4, the company that develops and sells SciDB.

2 . 无共享架构规定多节点分布式计算架构的所有参与节点都是自给自足的。用 SciDB 的说法,集群的参与单位是“实例”,而术语“节点”是为多计算机集群中的一台物理(或虚拟)计算机保留的。

2. A shared-nothing architecture prescribes that all participating nodes of a multi-node distributed computing architecture are self-sufficient. In SciDB parlance, the unit of participation in a cluster is the “instance,” while the term “node” is reserved for one physical (or virtual) computer within a multi-computer cluster.

30

30

Tamr 代码线

The Tamr Codeline

尼古拉斯·贝茨故居

Nikolaus Bates-Haus

数量、速度和多样性:在大数据的三个 V 中 [Laney 2001],处理数量和速度有完善的模式,但处理多样性则不然。处理数据多样性的最先进技术是手动数据准备,目前估计占大多数数据科学项目工作量的 80% [Press 2016]。降低成本将为现代数据驱动的组织带来巨大好处,因此它是一个活跃的研究和开发领域。Data Tamer 项目 [Stonebraker 等人。2013b] 引入了一种处理数据多样性的模式,但没有人将这样的系统投入生产。我们知道会遇到挑战,但不知道挑战是什么,因此我们组建了一支经验丰富、才华横溢、干劲十足的团队,并且敢于假设我们能够克服一切障碍。

Volume, velocity, and variety: Among the three Vs of Big Data [Laney 2001], there are well-established patterns for handling volume and velocity, but not so for variety. The state of the art in dealing with data variety is manual data preparation, which is currently estimated to account for 80% of the effort of most data science projects [Press 2016]. Bringing this cost down would be a huge benefit to modern, data-driven organizations, so it is an active area of research and development. The Data Tamer project [Stonebraker et al. 2013b] introduced a pattern for handling data variety, but nobody had ever put such a system into production. We knew there would be challenges, but not what they would be, so we put together a team with experience, talent, drive, and the audacity to assume we would overcome whatever got in our way.

我于 2013 年 4 月加入 Tamr,担任 3 号员工。我的第一个职责是招募核心工程团队并构建我们用于数据统一的商业产品的第一个版本(请参阅第21 章))。我加入 Tamr 的个人动机是,解决数据多样性的挑战将是我在 Thinking Machines(在 Darwin 数据挖掘平台上)、Torrent Systems(一家并行 ETL [提取、转换、加载] 系统)和 Endeca(在 MDEX 并行分析数据库上)。我见过混乱的数据挫败了太多的项目,在与 Tamr 联合创始人 Mike Stonebraker 和 Andy Palmer 讨论了他们的愿望后,我得出的结论是 Tamr 将提供机会应用数十年的经验和思维,从根本上改变方式我们接近大数据。

I joined Tamr in April 2013 as employee #3. My first tour of duty was to recruit the core engineering team and build the first release of our commercial product for data unification (see Chapter 21). My personal motivation to join Tamr was that tackling the challenge of data variety would be the culmination of a decades-long career working with data at large scale, at companies such as Thinking Machines (on the Darwin Data Mining platform), Torrent Systems (a parallel ETL [extract, transform, load] system) and Endeca (on the MDEX parallel analytic database). I’ve seen messy data foil far too many projects and, after talking to Tamr co-founders Mike Stonebraker and Andy Palmer about their aspirations, I concluded that Tamr would provide the opportunity to apply those decades of experience and thinking to fundamentally change the way we approach big data.

因为我们正在与 Tamr 一起开辟新天地,所以我们面临着许多不寻常的挑战。本章的其余部分着眼于一些最棘手的挑战、我们如何克服这些挑战、吸取的教训以及由此出现的一些令人惊讶的机会。

Because we were breaking new ground with Tamr, we faced a number of unusual challenges. The rest of this chapter looks at some of the most vexing challenges, how we overcame them, lessons learned, and some surprising opportunities that emerged as a result.

不伦不类

Neither Fish nor Fowl

从一开始,Tamr 代码线就是一种嵌合体,将强烈的技术需求与强烈的可用性需求结合在一起。我参加的第一次董事会会议是在我加入一周后,Andy 宣布我们正在为 Python 学术系统构建 Java 替代品。该系统还包含大量用于后端计算的 PL-SQL,以及用于用户界面的大量 JavaScript。我们使用 SQL 来表达后端计算这一事实揭示了 Tamr 最早且最持久的原则之一:我们不是构建自定义数据库。有时,这使事情变得异常困难:我们需要深入研究特定于数据库的表和查询优化以使系统正常运行。但它使我们能够继续专注于扩展功能,并依靠其他系统来实现数据移动速度、高可用性和灾难恢复等功能。

From the start, the Tamr codeline has been a kind of chimera, combining intense technical demands with intense usability demands. The first board meeting I attended was about a week after I joined, and Andy announced that we were well on our way to building a Java replacement for the Python academic system. The system also included large amounts of PL-SQL used to do the back-end computation, and large amounts of JavaScript used for the user interface. The fact that we used SQL to express our backend computation reveals one of the earliest and most enduring principles of Tamr: We are not building a custom database. At times, this has made things exceptionally difficult: We needed to get deep into database-specific table and query optimization to get the system to perform. But it enabled us to keep our focus on expanding our functionality and to lean on other systems for things like data movement speed, high availability, and disaster recovery.

没有数据统一的功能和性能基准列表。它没有事务处理性能委员会 (TPC) 基准测试,您可以在其中运行基准测试中的查询,并说您已经有了一个工作系统。最接近的领域是MDM(主数据管理),其历史上存在大量预算超支和交付价值太少的情况,以及ETL,这是一个通用数据处理工具包,具有无底的积压和大量咨询支持。我们并不确切地知道自己想成为什么,但我们知道这都不是。我们相信我们的代码线需要使数据工程师能够使用机器来组合大型企业内部的大量表格数据(参见 [Stonebraker et al. 2013b] Sec. 3.2)。

There is no list of functions and performance benchmarks for Data Unification. It doesn’t have a Transaction Processing Performance Council (TPC) benchmark where you can run the queries from the benchmark and say you’ve got a working system. The closest domains are MDM (master data management), with a history of massive budget overruns and too little value delivered, and ETL, a generic data processing toolkit with a bottomless backlog and massive consulting support. We didn’t know exactly what we wanted to be, but we knew it wasn’t either of those. We believed that our codeline needed to enable data engineers to use the machine to combine the massive quantities of tabular data inside a large enterprise (see [Stonebraker et al. 2013b] Sec. 3.2). We also believed that the dogma of traditional MDM and ETL toolkits was firmly rooted in the deterministic role of the data engineer, and that highly contextual recommenders—subject matter experts (or SMEs)—would have to be directly engaged to enable a new level of productivity in data delivery.

这就引出了一个问题:我们的用户是谁。数据传统上由数据工程师提供,数据的模式和语义传统上由数据科学家推导。但数据工程和数据科学组织的工作量已经超负荷,因此将其中任何一个放在交付路径上都会使 Tamr 的结果交付与 IT 积压工作纠缠在一起,并导致速度过慢。它还会将主题专家降低一级,从而使我们更难实现数据交付生产力的目标。

This begged the question of who our users would be. Data is traditionally delivered by data engineers, and patterns and semantics of data are traditionally derived by data scientists. But both data engineering and data science organizations are already overloaded with work, so putting either of those on the path to delivery would entangle delivery of results from Tamr with the IT backlog and slow things down too much. It would also hold the subject matter experts one level removed, making it dramatically harder to achieve our goals for productivity in data delivery.

这在公司内部造成了长期的紧张关系。一方面,数据统一项目中需要发生的大部分事情都是相当标准的数据工程:连接这些表;聚合该列;将此字符串解析为日期等。特别是 Mike,始终主张我们有一个“框和箭头”界面来支持数据工程师定义数据工程工作流程数据统一所必需的。争论的焦点是,这些接口在现有的 ETL 工具中是普遍存在且熟悉的,我们没有必要在那里进行创新。

This set up a long-standing tension within the company. On the one hand, much of what needs to happen in a data unification project is pretty standard data engineering: join these tables; aggregate that column; parse this string into a date, etc. Mike, in particular, consistently advocated that we have a “boxes and arrows” interface to support data engineers in defining the data engineering workflows necessary for data unification. The argument is that these interfaces are ubiquitous and familiar from existing ETL tools, and there is no need for us to innovate there.

另一方面,这种工具的普遍存在强烈反对构建另一种工具。我们不应该构建和交付看似将数据从系统 A 移动到系统 B 的另一种方式(并一路进行转换),而应该专注于我们的核心创新(模式映射、记录匹配和分类)并保留这些框框以及指向其他工具的箭头。

On the other hand, the very ubiquity of this kind of tool argues strongly against building another one. Rather than building and delivering what by all appearances is yet another way to move data from system A to system B, with transformations along the way, we should focus on our core innovations—schema mapping, record matching, and classification—and leave the boxes and arrows to some other tool.

我们在早期部署中看到,数据统一项目的很大一部分可以被提炼为围绕仪表板的一些简单活动,用于管理任务和专家反馈的审查。这可以由数据管理者来管理,数据管理者是一名非技术用户,能够判断数据的质量并监督项目以提高其质量。为了使这个简单的案例保持简单,我们使用这个预先设定的工作流程简化了部署。然而,许多项目需要更复杂的工作流程,特别是当团队开始将结果纳入关键操作时。为了确保满足这些部署的需求,我们构建了良好的端点和 API,以便核心功能可以轻松集成到其他系统中。因此,我们的许多项目在不到一个月的时间内就取得了初步成果,并且逐步实现 之后每周发货。这在客户组织内建立了对结果的需求,有助于促进与标准 IT 基础设施的集成。这种集成成为了另一个项目可交付成果,而不是交付有用结果的障碍。

We had seen in early deployments that enormous portions of data unification projects could be distilled to a few simple activities centered around a dashboard for managing assignment and review of expert feedback. This could be managed by a data curator, a non-technical user who is able to judge the quality of data and to oversee a project to improve its quality. To keep this simple case simple, we made it easy to deploy with this pre-canned workflow. However, many projects required more complex workflows, especially as teams started to incorporate the results into critical operations. To ensure that the needs of these deployments would also be met, we built good endpoints and APIs so the core capabilities could be readily integrated into other systems. As a result, many of our projects delivered initial results in under a month, with incremental, weekly deliveries after that. This built a demand for the results within the customer organization, helping to motivate integration into standard IT infrastructure. This integration became another project deliverable along the way rather than a barrier to delivering useful results.

课。当我们不能仅仅指向现有标准并说“看,我们好多了!”时,为从结果中受益的用户进行构建至关重要。

Lesson. Building for the users who benefit from the results is essential when we can’t just point to an existing standard and say, “Look, we’re so much better!”

驯服算法复杂性的野兽

Taming the Beast of Algorithmic Complexity

2013 年夏天,我们与一家大型信息服务提供商合作,将组织(公司、非营利组织、政府机构等)的信息与包含 3500 万个组织的主列表进行匹配。给定的输入预计有 1-200 万条列表,总计 70 万亿次比较。这是实体解析的核心 N2 挑战,对于企业数据专业人员来说自古就有。

In the summer of 2013, we were engaged with a large information services provider to match information on organizations—corporations, non-profits, government agencies, etc.—against a master list of 35 million organizations. A given input was expected to have 1–2 million listings, for a total of 70 trillion comparisons. This is the core N2 challenge of entity resolution that is as old as time for enterprise data professionals.

解决这一问题的广泛接受的技术是分块:识别一个或几个可用于将数据划分为不重叠的记录块的属性,然后仅在每个块内进行 N2 比较。分块需要深入了解数据以了解要使用哪些属性,以及高质量的数据以确保每条记录最终位于正确的块中。

The broadly accepted technique to address this is blocking: identifying one or a few attributes that can be used to divide the data into non-overlapping blocks of records, then doing the N2 comparisons only within each block. Blocking requires insight into the data to know which attributes to use, and high-quality data to ensure that each record ends up in the correct block.

我们已经详细讨论了我们希望建立一个能够处理许多实体类型的平台的愿望。我们已经有了一种独立于数据的方法来解决问题,但是学术实现在 3500 万条记录的规模上效果还不够好:查询需要几天的时间,而我们的客户希望在几分钟内完成批量查询,并在单记录查询只需几毫秒。尽管原始系统远不能满足这些目标——而且我们几乎没有一个可以让我们满足这些目标的设计——但我们同意这些目标以及能够大规模执行实体解析的后续目标在交付 3500 万个规模后,一年左右的时间里,并行硬件上的数量将达到 7000 万个。从业务角度来看,客户的要求是完全合理的,并且体现了我们相信会吸引其他客户的目标。因此,我们着手构建比以前系统中尝试的更好的东西。

We had discussed at great length our desire to build a platform that would be able to address many entity types. We already had a data-independent method of breaking the problem down, but the academic implementation didn’t work well enough at the scale of 35 million records: Queries would take days, and our customer wanted them in minutes for batch queries and in a few milliseconds for single-record queries. Even though the original system was far from being able to meet these targets—and we barely had a design that would let us meet them—we agreed to the goals and to the follow-up goal of being able to perform entity resolution at the scale of 70 million on parallel hardware within a year or so after delivering at the scale of 35 million. The customer’s requirements were entirely reasonable from a business perspective and embodied goals that we believed would be appealing to other customers. So we set out to build something better than what had been attempted in the previous systems.

Tamr 联合创始人 George Beskales 在学术原型中进行了重复数据删除方面的许多开创性工作,他提出了一种很有前途的方法,该方法结合了信息检索、连接优化和机器学习的技术。当该公司机器学习领域的联合创始人兼技术顾问 Ihab Ilyas 审查这种方法的最初定义时,他发现了多种可能导致该方法在现实场景中失败的挑战。这就开始了数周的紧张设计迭代,研究如何细分 N2 问题以使我们达到所需的性能。我们最终开发了一种独立于数据的方法,并为批处理作业提供出色的修剪,我们已经证明可以将其扩展到 2 亿条记录,并且没有达到任何硬​​限制的迹象。

Tamr co-founder George Beskales, who did much of the pioneering work on deduplication in the academic prototype, came up with a promising approach that combined techniques from information retrieval, join optimization, and machine learning. When Ihab Ilyas, co-founder and technical advisor to the company for all things machine learning, reviewed the initial definition of this approach, he identified multiple challenges that would cause it to fail in realistic scenarios. This kicked off several weeks of intense design iteration on how to subdivide the N2 problem to get us to the performance required. We ultimately developed an approach that is data-independent and delivers excellent pruning for batch jobs, with which we have demonstrated scaling up to 200 million records with no sign of hitting any hard limits. It is also very amenable to indexed evaluation, which provided a foundation on which we built the low-latency, single-record match also desired by many customers.

2016年冬天,我们与另一家大型信息服务提供商合作,对科学论文进行作者消歧。虽然我们的配对匹配按预期工作,但我们遇到了聚类方面的实际问题。该项目对学术期刊上发表的文章的作者进行重复数据删除,许多期刊仅使用作者的名字首字母和姓氏,从而导致作者姓名出现大量相连的子图,例如“C. 陈。” 由于在连接组件中的边数方面,聚类甚至比 N2 更差,因此即使在分布式硬件上,我们也无法完成 400 万个节点、4700 万个边连接组件的聚类。同样,George Beskales 和 Ihab Ilyas 与现场工程师/数据科学家 Eliot Knudsen 和 Claire O'Connell 一起,我们花了数周时间迭代设计,将这个大问题转化为易于处理的问题。通过排列数据来支持良好的不变量和调整通过利用这些不变量的聚类算法,我们能够推导出一种实用的聚类方法,该方法是可预测的、稳定的、分布良好的,并且在连接组件中的边数方面具有大约 Nlog(N) 的复杂性。这让我们能够解决大型信息服务提供商面临的集群挑战,以及尝试集成包括许多子公司的客户数据和主供应商数据的客户所出现的类似问题。

In the winter of 2016, we were working with another large information services provider to perform author disambiguation on scientific papers. While our pairwise match worked as desired, we ran into real problems with clustering. The project was deduplicating the authors of articles published in scholarly journals, and many journals use only author first initial and last name, leading to massive connected subgraphs for authors’ names such as “C. Chen.” Since clustering is even worse than N2 in the number of edges in the connected component, even on distributed hardware we weren’t able to complete clustering of a 4 millionnode, 47 million-edge-connected component. Again, George Beskales and Ihab Ilyas, in conjunction with field engineers/data scientists Eliot Knudsen and Claire O’Connell, spent weeks iterating on designs to convert this massive problem into something tractable. By arranging the data to support good invariants and tuning the clustering algorithm to take advantage of those invariants, we were able to derive a practical approach to clustering that is predictable, stable, distributes well, and has complexity of approximately Nlog(N) in the number of edges in a connected component. This let us tackle the clustering challenges for the large information services provider, as well as similar problems that had arisen with clients who were attempting to integrate customer data and master supplier data that included many subsidiaries.

课。努力满足客户看似不合理的需求可以带来巨大的创新突破。

Lesson. Pushing to meet customers’ seemingly unreasonable demands can lead to dramatic innovation breakthroughs.

将用户放在首位和中心

Putting Users Front and Center

Data Tamer 学术系统的最初目标是了解机器学习是否可以解决数据多样性的挑战——大数据中继数据量和速度之后的第三个“V”。该项目的早期工作表明,机器学习可以提供很大帮助,但结果质量不够好,企业不愿意将结果投入生产。当我们与诺华生物医学研究所 (NIBR) 的一位学术合作者合作开展模式映射项目时,我们有机会让主题专家直接参与中央机器学习周期,这对结果质量来说是变革性的。让主题专家直接审查机器学习结果,并让平台直接从他们的反馈中学习,让我们的结果质量达到了第 90 个百分点,这对于企业来说已经足够好了。Mark Schreiber(当时在 NIBR)是学术工作以及后来的商业工作的关键贡献者,巧妙而积极地将人类主题专业知识集成到我们用于模式映射的机器学习模型中。

The original goal of the Data Tamer academic system was to learn whether machine learning could address the challenge of data variety—the third “V” in big data, after volume and velocity. The early work on the project showed that machine learning could help quite a bit, but the quality of results wasn’t good enough that enterprises would be willing to put the results into production. When we engaged with one of the academic collaborators at the Novartis Institute for Biomedical Research (NIBR) on a schema mapping project, we had an opportunity to involve subject matter experts directly in the central machine learning cycle, and this was transformative for results quality. Having subject matter experts directly review the machine learning results, and having the platform learn directly from their feedback, got us into the high 90th percentile of results quality, which was good enough for enterprises to work with. Mark Schreiber (at NIBR at the time) was a key contributor to the academic effort as well as later commercial efforts to artfully and actively integrate human subject matter expertise into our machine learning models for schema mapping.

我们从一开始就知道,产品的成功或失败取决于我们通过深思熟虑的用户体验 (UX) 设计和实施来仔细整合人类专业知识的能力。对于一群数据库系统和机器学习算法人员来说,构建深思熟虑的用户体验并不是一项天生的技能,因此我们开始聘请优秀的用户体验和设计人员,并建立我们的产品开发实践,以保持用户体验的前沿和中心。

We knew from the beginning that the product would succeed or fail based on our ability to carefully integrate human expertise through thoughtful user experience (UX) design and implementation. Building thoughtful UX is not a natural skill set for a bunch of database system and machine-learning-algorithms folks, so we set out to hire great UX and design people and set up our product development practices to keep UX front and center.

该平台的早期版本没有纳入主动学习。客户总是想知道他们需要投入多少中小企业时间才能让系统提供高质量的结果。答案不可能是 10% 的覆盖率:在掌握 3500 万个组织的例子中,这意味着 350 万个标签,对于人类来说,这个数量级太多了,无法生产。中小企业是与其他人工作; 他们不能把所有的时间都花在标记数据上,尤其是当答案看起来很明显时。因此,我们结合了主动学习来显着减少系统所需的训练数据量。通过这些变化,实体母带项目能够提供高质量的结果,只需主题专家几天的时间来训练初始模型,并且持续的主题专家参与度非常低,以保持系统调整。

Early versions of the platform did not incorporate active learning. Customers always want to know how much SME time they will need to invest before the system will deliver high-quality results. The answer can’t be 10% coverage: In the example of mastering 35 million organizations, this would mean 3.5 million labels, which is orders of magnitude too many for humans to produce. SMEs are people with other jobs; they can’t spend all their time labeling data, especially when the answers seem obvious. We thus incorporated active learning to dramatically reduce the amount of training data the system needs. With these changes, entity mastering projects are able to deliver high-quality results with a few days of subject matter experts’ time to train an initial model, and very low ongoing subject matter expert engagement to keep the system tuned.

中小企业对于时间被浪费非常敏感。除了系统不询问答案“显而易见”的问题外,中小企业也不希望被多次询问相同的问题,甚至非常相似的问题。我们还需要对中小企业提出的问题进行优先排序,最有效的方法是“预期影响”:回答这个问题会对系统产生多大影响?为了计算这一点,我们需要估计系统能够自动回答多少个“相似”问题,以及每个问题影响的指标。很早就,我们纳入了一个价值指标,通常标记为“支出”,系统可以用它来评估价值。我们使用粗聚类来估计问题的总体影响,并可以按这种方式确定优先级。Tamr 系统为中小企业提供了提供问题反馈的设施,当他们觉得系统在浪费他们的时间时,他们并不羞于使用这个设施来抱怨。当我们纳入这些更改时,此类反馈的比率急剧下降。

SMEs are very sensitive to having their time wasted. In addition to the system not asking questions with “obvious” answers, SMEs don’t want to be asked the same question, or even very similar questions, multiple times. We also need to prioritize the questions we’re asking of SMEs, and the most effective method is by “anticipated impact”: How much impact will answering this question have on the system? To calculate this, we need both an estimate of how many “similar” questions the system will be able to answer automatically and a metric for impact of each question. Very early on, we incorporated a value metric, often labeled “spend,” that the system can use to assess value. We use coarse clustering to estimate the overall impact of a question and can prioritize that way. The Tamr system provides a facility for SMEs to provide feedback on questions, and they have not been shy about using this facility to complain when they feel like the system is wasting their time. When we incorporated these changes, the rate of that kind of feedback plummeted.

数据管理员需要了解中小企业标签工作流程,因此我们构建了仪表板,显示谁参与其中、他们的参与程度、他们有哪些出色的工作等。我们收到的一些早期反馈是,如果中小企业认为他们不会使用系统它可以在绩效评估中用来针对他们,例如,如果它显示了他们相对于某些“基本事实”概念的表现如何。为了克服这一参与障碍,我们与策展人合作,寻找方法为他们提供所需的见解,而无需提供中小企业的绩效评分。结果是,策展人拥有了保持项目步入正轨所需的可见性和工具,并且中小企业的参与度始终很高。

Data curators need visibility into the SME labeling workflow, so we built dashboards showing who is involved, how engaged they are, what outstanding work they have, etc. Some of the early feedback we received is that SMEs will not use a system if they believe it can be used against them in a performance review, e.g., if it shows how they perform relative to some notion of “ground truth.” To overcome this impediment to engagement, we worked with curators to identify ways to give them the insight they need without providing performance scores on SMEs. The result is that curators have the visibility and tools they need to keep a project on track, and SME engagement is consistently high.

课。围绕与非工程师的直接交互构建一个系统非常具有挑战性,但它使我们能够在短时间内交付,否则这是不可能的。人类将在我们的系统中占据主导地位,我们需要在我们的核心团队中进行人因工程和设计。

Lesson. Building a system around direct interaction with non-engineers is very challenging, but it enables us to deliver on a short timeline not otherwise possible. Humans were going to be primary in our system and we needed to have human factors engineering and design on our core team.

尊重多样性的规模

Scaling with Respect to Variety

学术系统(当然)是建立在 Postgres 上的(参见第 16 章),商业平台的第一个版本也是建立在 Postgres 上的,利用了一些 Postgres 特定的功能——比如任意长度的“文本”数据类型——使设计相对简单。但使用Postgres作为后端有两个缺点:首先,大多数IT部门不会为Postgres承担业务连续性和灾难恢复(BCDR)或高可用性(HA)的责任;其次,它限制了我们单核查询评估的实际规模。我们知道我们需要利用多核,并最终利用横向扩展架构,但我们也知道构建大型分布式系统将很困难,存在许多潜在的陷阱和问题。

The academic system was built on Postgres (of course) (see Chapter 16), and the first versions of the commercial platform were also built on Postgres, taking advantage of some Postgres-specific features—like arbitrary length “text” data type—that make design relatively straightforward. But using Postgres as the backend had two disadvantages: first, most IT departments will not take responsibility for business continuity and disaster recovery (BCDR) or high availability (HA) for Postgres; and second, it limited us to the scale practical with single-core query evaluation. We knew we needed to take advantage of multi-core, and eventually of scale-out, architecture, but we also knew building a large distributed system would be hard, with many potential pitfalls and gotchas.

因此,我们评估了几种不同的选择。

We therefore evaluated a few different options.

选项 #1 是显式并行性:由于 Postgres 可以在连接级别并行化,因此我们可以重写查询,以便并行查询的每个分区都在单独的连接中运行。我们需要自己管理事务、一致性等。这相当于我们自己构建一个并行 RDBMS,用 Mike 的话说,我们应该“在我的尸体上”追求这一选择。

Option #1 was explicit parallelism: Since Postgres could parallelize at the connection level, we could rewrite our queries so that each partition of a parallel query was run in a separate connection. We would need to manage transactions, consistency, etc., ourselves. This would be tantamount to building a parallel RDBMS ourselves, an option that, in Mike’s words, we should pursue “over my dead body.”

选项 #2 是迁移到支持查询内部并行性的平台,理论上使并行性在 SQL 级别对我们不可见。Vertica 和 Oracle 等系统提供了这一点。此选项的优点是 IT 组织已经熟悉如何为这些平台提供 BCDR 和 HA。但它也有多个缺点:它需要客户携带昂贵的数据库许可证以及 Tamr 许可证;它将要求我们支持许多不同的数据库及其所有特性;它的寿命也值得怀疑,因为我们从许多客户那里听说他们正在放弃传统的专有关系数据库并采用更便宜的替代方案。

Option #2 was to migrate to a platform that supported parallelism internal to a query, in theory making parallelism invisible to us at the SQL level. Systems such as Vertica and Oracle provide this. This option had the advantage that IT organizations would already be familiar with how to provide BCDR and HA for these platforms. But it also had multiple downsides: It would require customers to carry an expensive database license along with their Tamr license; it would require us to support many different databases and all their idiosyncrasies; and its longevity was questionable, as we had heard from many of our customers that they were moving away from traditional proprietary relational databases and embracing much less-expensive alternatives.

选项 #3 是采用我们的客户正在考虑的较便宜的替代方案之一,并重写后端以在不承担单独许可证负担的横向扩展平台上运行。Impala 和 Spark 是这方面的重要候选者。此选项的缺点是,IT 组织可能不会很快知道如何为这些系统提供 BCDR 或 HA,但许多组织正在构建数据湖团队来做到这一点,因此,此选项似乎会受到影响正波。

Option #3 was to embrace one of the less-expensive alternatives our customers were considering and rewrite the backend to run on a scale-out platform that didn’t carry the burden of a separate license. Impala and Spark were serious candidates on this front. The disadvantage of this option was that IT organizations probably wouldn’t know any time soon how to provide BCDR or HA for these systems, but many organizations were building data lake teams to do exactly that, so it seemed like this option would be riding a positive wave.

经过大量激烈的辩论后,我们决定采用选项#2,并构建支持多个后端的插件能力,希望能降低最终采用选项#3 的成本。我们的早期客户已经拥有 Oracle 许可证和 DBA,因此我们从这里开始。我们最初的估计是大约需要三个月的时间才能将我们的后端代码移植到 Oracle 上运行。尽管该产品的功能最终与 Oracle 后端有所不同,但对于功能完整的系统来说,这一估计最终是正确的,并且又花了六个月的时间才达到我们预期的性能。一旦我们解决了 Oracle 移植问题,我们开始为后端设计 Vertica 端口原型。我们很快就确定,由于 Vertica 的行为与 Postgres 的差异甚至比 Oracle 的差异更大,因此我们从 Oracle 移植中获得的影响力非常小,并估计 Vertica 移植还需要 6 到 9 个月的时间,这对于小型早期企业来说成本过高。舞台公司。

After a lot of intense debate, we decided to take option #2, and build plug-ability to support multiple backends, hopefully reducing the cost of eventually pursuing option #3. Our early customers already had Oracle licenses and DBAs in place, so we started there. Our initial estimate was that it would take about three months to port our backend code to run on Oracle. That estimate ended up about right for a functionally complete system, although the capabilities of the product ended up being different with an Oracle backend, and it took six more months to get the performance we expected. Once we had the Oracle port ironed out, we started prototyping a Vertica port for the backend. We quickly determined that, because the behavior of Vertica is even more different from Postgres than Oracle’s, we would get very little leverage from the Oracle port and estimated another six to nine months for the Vertica port, which was exorbitantly expensive for a small, early-stage company.

这些端口如此困难的原因在于 Tamr 平台的本质和用途的核心。一个简化的版本是,它采用任何形式的客户数据,使用模式映射将其与客户设计的统一模式对齐,然后对数据运行实体解析,以用户定义的模式提供结果。为了支持此工作流程,后端需要容纳多种模式中的数据。例如,一位进行临床试验仓储的客户想要重新格式化 1,500 项研究。每项研究跨越 30 个领域,每个领域的来源平均有 5 个表。为了以源格式表示这些数据,两个版本的 SDTM(研究数据表格模型 - 药物临床试验数据的监管数据模型)及其自己的内部临床试验格式 - 结果为 1,500 × 30 × 8 = 360,000桌。另一个客户拥有 15,000 项研究,360 万张表格。现有 RDBMS 无法处理这种数据规模(表数量的规模)。

The reason the ports were so difficult lies at the core of what the Tamr platform is and does. A simplified version is that it takes customer data in whatever form, uses schema mapping to align it to a unified schema of our customer’s design, and then runs entity resolution on the data, delivering the results in that user-defined schema. To support this workflow, the backend needs to accommodate data in a large variety of schemas. For example, a customer doing clinical trial warehousing had 1,500 studies they wanted to reformat. Each study spans 30 domains, and the source for each domain averages 5 tables. To represent these in source format, two versions of SDTM (Study Data Tabulation Model—a regulatory data model for data on clinical trials of pharmaceuticals), and their own internal clinical trial format—results in 1,500 × 30 × 8 = 360,000 tables. Another customer has 15,000 studies, for 3.6 million tables. This kind of data scale—scale in the number of tables—is not something that existing RDBMSs are designed to handle.

学术界的 Data Tamer 系统选择了一种特殊的方法来解决这个问题,早期版本的 Tamr 平台也使用了相同的方法。所有数据都作为实体、属性、值 (E、A、V) 三元组加载到单源表中,而不是将每个逻辑表表示为单独的数据库表,实际上是 (表、E、A、V) 四元组——有第二个(E、A、V)表,用于所有表的统一数据。然后,平台可以将逻辑表定义为临时视图,首先过滤 EAV 表,然后使用交叉表将 EAV 转换为矩形表。这样,RDBMS 可见的表数量仅限于运行查询时实际使用的表,从而使任一时间可见的表总数保持在 RDBMS 可以处理的范围内。

The academic Data Tamer system chose a particular approach to address this problem, and the early versions of the Tamr platform used the same approach. Rather than represent each logical table as a separate database table, all the data was loaded as entity, attribute, value (E, A, V) triples in a single-source table—actually, (Table, E, A, V) quads—with a second (E, A, V) table for the unified data of all tables. The platform could then define the logical tables as temporary views that first filtered the EAV table, then used a crosstab to convert from EAV to rectangular table. This way, the number of tables visible to the RDBMS was limited to the tables actually in use by running queries, keeping the total number of tables visible at any one time within the limits of what the RDBMS could handle.

这种方法的缺点是所有数据处理工作流都需要从相同的两个表中加载、更新和读取,因此,尽管它可以根据各种输入源进行扩展,但它不会根据源活动进行扩展。尽管我们可以满足客户对一个工作流程中更改传播延迟的要求,但在多个工作流程中满足这些要求需要额外的 Tamr 后端。这与我们的目标背道而驰,即拥有一个能够随着数据多样性或输入表数量平滑扩展的系统。

The downfall of this approach was that all data processing workflows needed to load, update, and read from the same two tables, so, although it scales with respect to variety of input sources, it does not scale with respect to source activity. Although we could meet customer requirements for latency in propagation of changes in one workflow, meeting those requirements in multiple workflows required additional Tamr backends. This was antithetical to our goal of having a system that scales smoothly with data variety, or the number of input tables.

这促使我们减少对支持额外 RDBMS 后端的投资,并加速我们对选项 3 的追求,采用不存在数十万甚至数百万表问题且支持横向扩展查询评估的后端。该平台是用于全数据查询的Spark和用于索引查询的HBase的组合。

This motivated us to curtail our investment in supporting additional RDBMS backends and accelerate our pursuit of option #3, embracing a backend that does not have issues with hundreds of thousands or even millions of tables and that supports scale-out query evaluation. This platform is a combination of Spark for whole-data queries and HBase for indexed queries.

图像

图 30.1  特别是在设计和构建数据统一软件系统时,技术和业务策略必须共同发展。左起是咨询软件工程师 John“JR”Robinson;Tamr 联合创始人安迪·帕尔默 (Andy Palmer)、迈克尔·斯通布雷克 (Michael Stonebraker) 和乔治·贝斯卡莱斯 (George Beskales);我(技术主管 Nik Bates-Haus);和解决方案开发人员 Jason Liu。技术联合创始人 Alex Pagan、Daniel Bruckner 和 Ihab Ilyas 通过 Google Hangout(在屏幕上)出现在下方。

Figure 30.1  Particularly when architecting and building a data unification software system, technology and business strategy must evolve together. Standing, from left, are consulting software engineer John “JR” Robinson; Tamr co-founders Andy Palmer, Michael Stonebraker, and George Beskales; me (technical lead Nik Bates-Haus); and solution developer Jason Liu. Technical co-founders Alex Pagan, Daniel Bruckner, and Ihab Ilyas appear below via Google Hangout (on the screen).

课。系统或平台的扩展在意想不到的地方遇到了限制;在我们的例子中,后端可以处理的表数量受到限制。我们正在突破传统数据管理系统的极限。技术和商业战略相互交织,需要共同发展。

Lesson. Scaling a system or platform hits limits in unexpected places; in our case, limits in the number of tables a backend can handle. We are pushing the limits of what traditional data management systems are able to do. Technology and business strategy are entangled and need to evolve together.

结论

Conclusion

Data Tamer 系统走出了 Mike 完善的数据库领域,进入了数据集成领域。为了与 Mike 的主张相一致,即商业化是验证技术的唯一途径,Tamr 的成立是为了将 Data Tamer 系统中开发的想法商业化。在某种程度上,客户使用明确的指标(例如性能和可扩展性)来判断 Tamr 系统的工作情况,而我们的客户本身就是重要指标的最佳指南,无论这些指标在我们看来多么不合理。但 Tamr 的大部分数据集成方法只能通过更抽象的衡量标准来评估,例如交付数据的时间或主题专家的参与。随着我们继续与迈克密切合作以实现他的愿景和指导,1以及他们给同行的感言,描述了他们长期以来认为不可能的事情如何突然变得可能。2

The Data Tamer system ventured out of Mike’s well-established domain of databases and into the domain of data integration. In keeping with Mike’s assertion that commercialization is the only way to validate technology, Tamr was formed to commercialize the ideas developed in the Data Tamer system. In part, customers judge how well the Tamr system works using clear metrics, such as performance and scalability, and our customers themselves are the best guides to the metrics that matter, however unreasonable those metrics might seem to us. But much of Tamr’s approach to data integration can only be assessed by more abstract measures, such as time to deliver data, or subject matter expert engagement. As we continue to work closely with Mike to realize his vision and guidance, the ultimate validation is in the vast savings our customers attribute to our projects1 and the testimonials they give to their peers, describing how what they have long known to be impossible has suddenly become possible.2

图像

图 30.2   Tamr 创始人、员工及其家人在新罕布什尔州温尼珀索基湖的 Mike 和 Beth Stonebraker 湖边小屋享受 2015 年 Tamr 夏季郊游。

Figure 30.2  Tamr founders, employees, and their families enjoy the 2015 Tamr summer outing at Mike and Beth Stonebraker’s lake house on Lake Winnipesaukee in New Hampshire.

1 .“GE 与 Tamr 合作已实现 100 多百万美元的投资回报” https://www.tamr.com/case-study/tamrs-role-ges-digital-transformation-newest-investor/。上次访问时间为 2018 年 4 月 22 日。

1.“$100+ millions of dollars of ROI that GE has already realized working with Tamr” https://www.tamr.com/case-study/tamrs-role-ges-digital-transformation-newest-investor/. Last accessed April 22, 2018.

2 .“GSK 采用 Tamr 的概率匹配方法,在 3 个月内将整个组织和三个不同初始领域(化验、临床试验数据和遗传数据)的数据组合成一个基于 Hadoop 的数据——这是一个闻所未闻的目标,使用传统的数据管理方法。'” https://www.tamr.com/forbes-tamr-helping-gsk-bite-data-managementbullet/。上次访问时间为 2018 年 4 月 22 日。

2.“GSK employed Tamr’s probabilistic matching approach to combine data across the organization and across three different initial domains (assays, clinical trial data, and genetic data) into a single Hadoop-based data within 3 months—‘an unheard-of objective using traditional data management approaches.’” https://www.tamr.com/forbes-tamr-helping-gsk-bite-data-managementbullet/. Last accessed April 22, 2018.

31

31

BigDAWG 代码线

The BigDAWG Codeline

维杰·加德帕利

Vijay Gadepally

介绍

Introduction

对于英特尔大数据科学技术中心 (ISTC) 的参与者来说,1发布的 Polystore 原型 BigDAWG 是 Mike Stonebraker 领导的多年合作的结晶。我于 2014 年初作为麻省理工学院林肯实验室的研究员加入该项目,此后帮助领导 BigDAWG 代码线的开发,并继续倡导 Polystore 系统的概念。

For those involved in the Intel Science and Technology Center (ISTC) for Big Data,1 releasing the prototype polystore, BigDAWG, was the culmination of many years of collaboration led by Mike Stonebraker. I joined the project as a researcher from MIT Lincoln Laboratory early in 2014 and have since helped lead the development of the BigDAWG codeline and continue to champion the concept of polystore systems.

对于许多参与 BigDAWG 开发的人来说,2017 年将该软件作为开源项目发布是他们职业生涯中的重要一步。BigDAWG 背后的背景(例如架构、定义和性能结果)在第 22 章中给出。本章介绍 BigDAWG 代码线开发的幕后花絮。

For many involved in the development of BigDAWG, releasing the software as an open-source project in 2017 was a major step in their careers. The background behind BigDAWG—such as architecture, definitions and performance results—is given in Chapter 22. This chapter gives a behind-the-scenes look at the development of the BigDAWG codeline.

尤其是 Polystores 和 BigDAWG 的概念从一开始就是一个雄心勃勃的想法。Mike 在他的 ICDE 论文 [Stonebraker 和 Çetintemel 2005] 中讨论了未来的愿景,涉及多个独立且异构的数据存储一起工作,每个数据存储都处理它们最适合的数据部分。BigDAWG 是 Mike 愿景的具体体现。

The concept of polystores and BigDAWG, in particular, has been an ambitious idea from the start. Mike’s vision of the future, discussed in his ICDE paper [Stonebraker and Çetintemel 2005], involves multiple independent and heterogeneous data stores working together, each working on those parts of the data for which they are best suited. BigDAWG is an instantiation of Mike’s vision.

回顾 Mike 对数据库系统领域的众多贡献的时间表,BigDAWG 是最新的项目之一。迈克的远见和领导力在项目的各个阶段都至关重要。Mike 对 Polystore 系统的愿景是创建 ISTC 的主要推动力之一。迈克诚实、直率的沟通和态度使分散在各地的团队朝着共同的目标前进。迈克的领导力、务实的经验,麻省理工学院、华盛顿大学、西北大学、布朗大学、芝加哥大学和波特兰州立大学的一群研究人员共同努力,不仅推进了自己的研究,还将他们的贡献融入到了研究中,深厚的理论知识是无价的。更大的 BigDAWG 代码线。

Looking back at the timeline of Mike’s numerous contributions to the world of database systems, BigDAWG is one of the more recent projects. Mike’s vision and leadership were critical in all stages of the project. Mike’s vision of a polystore system was one of the large drivers behind the creation of the ISTC. Mike’s honest, straightforward communication and attitude kept the geographically distributed team moving towards a common goal. Mike’s leadership, pragmatic experience, and deep theoretical knowledge were invaluable as a group of researchers spread across MIT, University of Washington, Northwestern University, Brown University, University of Chicago, and Portland State University worked together to not only advance their own research, but also integrate their contributions into the larger BigDAWG codeline.

BigDAWG 项目的最大优势之一是来自国内一些最好的数据库团体的多元化贡献者的贡献。虽然这有助于理论的发展,但一个实际挑战是围绕地理距离进行研究。因此,从项目的早期开始,我们就意识到,在黑客马拉松和冲刺期间完成主要代码集成,而不是每周一次的电话会议和 Skype 会议,将是最有效的。为了使自己与前沿研究保持一致,我们还确保这些黑客马拉松能够带来演示和出版物。

One of the greatest strengths of the BigDAWG project has been the contributions from a diverse set of contributors across some of the best database groups in the country. While this was helpful in developing the theory, one practical challenge was working around the geographic distance. Thus, from very early on in the project we realized that, instead of weekly telecons and Skype sessions, it would be most efficient to have major code integrations done during hackathons and sprints. To keep ourselves in line with cutting-edge research, we also made sure that these hackathons led to demonstrations and publications.

开发过程(借用Mike的图灵讲座2中的术语)是:

The process of development (borrowing terminology from Mike’s Turing lecture2) was:

图像

正如您所看到的,BigDAWG 是分部分开发的,每个新版本都比之前的版本更接近 Mike Polystore 的愿景。代码线的开发在很多方面都是独一无二的:(1)各个组件是由不同的研究小组构建的,每个研究小组都有自己的研究议程;(2) 利用黑客马拉松将这些个人贡献整合到一个连贯的系统中;(3) 我们与最终用户密切合作以创建相关演示。今天的 BigDAWG [Gadepally 等人。2017]是一个允许用户管理异构数据库管理系统的软件包。BigDAWG 代码线3由中间件、连接到数据库(例如 Postgres 和 SciDB)的连接器(垫片)以及简化 BigDAWG 入门的软件(例如用于简化数据加载的管理界面和脚本)组成。该中间件支持分布式查询规划、优化和执行、数据迁移和监控。数据库连接器允许用户获取现有数据库中的数据并将其快速注册到中间件,以便可以通过 BigDAWG 中间件编写查询。我们还有一个 API,可用于发出查询、开发新的岛屿以及集成新的数据库系统。有关 BigDAWG 项目的最新新闻和状态,请访问http://bigdawg.mit.edu

As you can see, BigDAWG was developed in parts, with each new version a closer representation of the Mike polystore vision than the previous. The development of the codeline was unique in many ways: (1) individual components were built by different research groups, each with their own research agenda; (2) hackathons were used to bring these individual contributions into a coherent system; and (3) we worked closely with end users to create relevant demonstrations. BigDAWG today [Gadepally et al. 2017] is a software package that allows users to manage heterogeneous database management systems. The BigDAWG codeline3 is made up of middleware, connectors (shims) to databases such as Postgres and SciDB, and software that simplifies getting started with BigDAWG such as an administrative interface and scripts to simplify data loading. The middleware enables distributed query planning, optimization and execution, data migration, and monitoring. The database connectors allow users to take data in existing databases and quickly register them with the middleware so that queries can be written through the BigDAWG middleware. We also have an API that can be used to issue queries, develop new islands, and integrate new database systems. The latest news and status on the BigDAWG project can be found at http://bigdawg.mit.edu.

图像

图 31.1   BigDAWG 里程碑的时间表。

Figure 31.1  Timeline of BigDAWG milestones.

该项目的几个主要里程碑如图31.1所示。

A few major milestones in the project are shown in Figure 31.1.

根据外部观察,BigDAWG 项目在许多方面都取得了重大成功。

Based on external observations, the BigDAWG project has been a major success on many fronts.

1.它汇集了数据库系统领域的一些领先思想。

1.  It has brought together some of the leading minds in database systems.

2. 它开发了一个代码库,作为“polystore”概念的原型实现。

2.  It has developed a codebase that serves as a prototype implementation of the “polystore” concept.

3、有效开创了数据管理研究的新领域。例如,Edmon Begoli 博士(橡树岭国家实验室首席数据架构师)表示:“我们在 2000 年代末认识到,通过对大型医疗数据问题的研究,数据管理和相关数据分析中的异质性将导致在相当长一段时间内都是一个挑战。与 Mike 合作后,我们了解了 Polystore 概念以及由他的团队领导的 BigDAWG 工作。我们参与了早期采用,它已成为我们数据库研究的主要重点领域之一。”

3.  It has effectively created a new area of data management research. For example, Dr. Edmon Begoli (Chief Data Architect, Oak Ridge National Laboratory) says: “We recognized in the late 2000s, and through our work on large healthcare data problems, that the heterogeneity in data management and related data analysis is going to be a challenge for quite some time. After working with Mike, we learned about the polystore concept, and the BigDAWG work being led by his team. We got involved with early adoption, and it has become one of our major focus areas of database research.”

虽然仍然是一个相对年轻的代码线,但我们对 BigDAWG 的发展方向感到非常兴奋。

While still a relatively young codeline, we are very excited about where BigDAWG is headed.

BigDAWG 的起源

BigDAWG Origins

第一个“BigDAWG”概念验证于 2014 年 8 月在俄勒冈州波特兰市的 ISTC 静修会上进行了演示。此时,Mike 假设医疗应用程序将是 Polystore 系统的完美用例。幸运的是,麻省理工学院的研究人员 Peter Szolovits 和 Roger Mark 开发并发布了一个丰富的医疗数据集,称为 MIMIC(重症监护中多参数智能监测的缩写)。您可以在 Johnson 等人处找到有关该数据集的更多信息。[2016]。

The first “BigDAWG” proof of concept was demonstrated during the ISTC Retreat in Portland, Oregon, in August 2014. At this time, Mike postulated that a medical application would be the perfect use-case for a polystore system. Fortunately, fellow MIT researchers Peter Szolovits and Roger Mark had developed and released a rich medical dataset called MIMIC (short for Multiparameter Intelligent Monitoring in Intensive Care). You can find more information about the dataset at Johnson et al. [2016].

所以,我们有一个数据集,但没有想到应用程序,没有中间件,或者任何东西。我们承诺在我们还不知道“BigDAWG”是什么之前就展示它。当我回顾过去时,准备开火瞄准的开发方法似乎是大多数主要 BigDAWG 开发的中心主题。

So, we had a dataset but no application in mind, no middleware, or anything, really. We promised to demonstrate “BigDAWG” well before we had any idea what it would be. As I look back, the ready-fire-aim approach to development seems to be a central theme to most of the major BigDAWG developments.

我们在华盛顿大学 (UW) 充分借鉴了他们构建 Myria 系统的经验,组装了 BigDAWG 系统的第一个原型 [Halperin 等人,2017]。2014] 以及我们开发 D4M 的工作 [Gadepally 等人。2015]。在为期两天的冲刺中,Andrew Whittaker、Bill Howe 和我(在麻省理工学院的 Mike、Sam Madden 和 Jeremy Kepner 的远程支持下)能够给出一个非常简单的演示,使我们能够执行简单的医学分析——心脏速率变异性——使用 SciDB(参见第 20 章))和 Postgres/Myria。在此演示中,患者元数据(例如服用的药物)存储在 Postgres/Myria 中,并使用 SciDB 来计算实际心率变异性。虽然这是一个简单的实现,但这次演示让我们相信有可能构建一个更强大的系统来实现 Mike 更宏伟的 Polystore 愿景 [Stonebraker 2015c]。有趣的是,经过几天的仔细和成功的测试,英特尔的实际演示失败了,因为有人在演示之前关闭了演示笔记本电脑,以及随后重新连接到英特尔 Wi-Fi 的复杂情况。然而,我们能够再次运行演示,并且这个初始演示虽然悬而未决,但对于 Polystore 系统的开发非常重要。

We put together the first prototype of the BigDAWG system at the University of Washington (UW) by drawing heavily upon their experience building the Myria system [Halperin et al. 2014] and our work developing D4M [Gadepally et al. 2015]. Over a two-day sprint, Andrew Whittaker, Bill Howe, and I (with remote support from Mike, Sam Madden, and Jeremy Kepner of MIT) were able to give a very simple demonstration that allowed us to perform a simple medical analysis—heart rate variability—using SciDB (see Chapter 20) and Postgres/Myria. In this demonstration, patient metadata such as medications administered was stored in Postgres/Myria and SciDB was used to compute the actual heart rate variability. While this was a bare-bones implementation, this demonstration gave us the confidence that it would be possible to build a much more robust system that achieved Mike’s grander polystore vision [Stonebraker 2015c]. Funny enough, after days of careful and successful testing, the actual demo at Intel failed due to someone closing the demonstration laptop right before the demo and subsequent complications reconnecting to Intel’s Wi-Fi. However, we were able to get the demo running again and this initial demonstration, while hanging by a thread, was important in the development of polystore systems.

图像

图 31.2  使用 BigDAWG 进行 MIMIC II 演示的屏幕截图,为英特尔务虚会提供。

Figure 31.2  Screenshots for MIMIC II demonstration using BigDAWG, presented for the Intel Retreat.

首次公开 BigDAWG 演示

First Public BigDAWG Demonstration

ISTC 演示中的 BigDAWG 最初原型向我们和其他人证明,东海岸和西海岸的研究人员可以就概念达成一致并共同努力,而且 Mike 的 Polystore 愿景可能是巨大的。然而,我们也意识到我们最初的原型虽然在展示概念方面很有用,但并不是真正的 Polystore 系统。

The initial prototype of BigDAWG at the ISTC demonstration proved, to us and others, that it was possible for East and West Coast researchers to agree on a concept and work together and that Mike’s polystore vision could be huge. However, we also realized that our initial prototype, while useful in showcasing a concept, was not really a true polystore system.

在 2015 年 1 月在英特尔圣克拉拉办事处举行的一次会议上,我们决定继续前进,利用从第一次演示中吸取的经验教训,并开发一个遵循 Mike 在其 ACM SIGMOD 博客文章 [Stonebraker 2015c] 中阐述的原则的 Polystore。此外,由于 Mike 想要一个最终应用程序,我们还决定将开发重点围绕前面提到的 MIMIC 数据集。在 2015 年 1 月的会议上,BigDAWG 研究人员和英特尔赞助商制定了原型系统的外观以及展示 Polystore 实际应用的演示大纲(我们最终在 VLDB 2015 上演示了用于医疗分析的 BigDAWG [Elmore 等人,2015] ]、英特尔和许多其他场所)。拟议的演示将整合 ISTC 研究人员正在开发的近 30 种不同技术。当个别研究人员正在开发自己的研究和代码时,我们带头努力将所有这些伟大的技术整合在一起。目标是让 ISTC 研究人员能够突破自己工作的界限,同时仍然为更大的 Polystore 愿景做出贡献。

During a meeting at Intel’s Santa Clara office in January 2015, we decided to push forward, use lessons learned from the first demonstration, and develop a polystore that adhered to the tenets laid out by Mike in his ACM SIGMOD blog post [Stonebraker 2015c]. Further, since Mike wanted an end application, we also decided to focus this development around the MIMIC dataset mentioned earlier. During this January 2015 meeting, BigDAWG researchers and Intel sponsors charted out what the prototype system would look like along with an outline for a demonstration to showcase the polystore in action (we eventually demonstrated BigDAWG for medical analytics at VLDB 2015 [Elmore et al. 2015], Intel, and many other venues). The proposed demonstration would integrate nearly 30 different technologies being developed by ISTC researchers. While individual researchers were developing their own research and code, we led the effort in pulling all these great technologies together. The goal was to allow ISTC researchers to push the boundaries of their own work while still contributing to the larger polystore vision.

图像

图 31.3  黑客马拉松图片。(左上)初始演示线框图、(右上)初始 BigDAWG 架构、(左下)黑客马拉松实际操作以及(右下)VLDB 2015 上的实际演示。

Figure 31.3  Hackathon pictures. (Top left) initial demonstration wireframes, (top right) initial BigDAWG architecture, (bottom left) hackathon in action, and (bottom right) demonstration in action at VLDB 2015.

通过定期会议和非常大的电子表格来跟踪各个项目的状态。我们还修复了数据集,并向各个合作者表示,特定的 MIT 集群将用于软件集成和演示。这帮助我们避免了大型软件集成工作中可能出现的一些兼容性问题。2017 年 7 月,我们在麻省理工学院举办了一场黑客马拉松,来自麻省理工学院、华盛顿大学、布朗大学、芝加哥大学、西北大学和波特兰州立大学的研究人员齐聚一堂。漫长的夜晚和披萨为开发第一个 BigDAWG 代码线提供了所需的燃料。在本次黑客马拉松结束时,我们有了第一个 BigDAWG 代码线 [Dziedzic 等人。2016],一个时髦的演示,以及一长串缺失的功能。

Keeping track of the status of various projects was done via regular meetings and a very large spreadsheet. We also fixed the dataset and expressed to the various collaborators that a particular MIT cluster would be used for the software integration and demonstration. This helped us avoid some of the compatibility issues that can arise in large software integration efforts. In July 2017, we held a hackathon at MIT that brought together researchers from MIT, University of Washington, Brown, University of Chicago, Northwestern University, and Portland State University. Lots of long nights and pizza provided the fuel needed to develop the first BigDAWG codeline. By the end of this hackathon, we had our first BigDAWG codeline [Dziedzic et al. 2016], a snazzy demonstration, and a very long list of missing features.

课。整合多个并行研究项目可能是一个挑战;然而,从一开始就明确愿景并修复某些参数(例如数据集和开发环境)可以大大简化集成。

Lesson. Integrating multiple parallel research projects can be a challenge; however, clear vision from the beginning and fixing certain parameters such as datasets and development environments can greatly simplify integration.

经过多次成功的演示后,我们中的许多人都清楚 Polystore 系统将会大放异彩。然而,即使演示成功,底层系统仍然没有原则性的方法来完成查询优化和数据迁移等重要任务,也没有明确的查询语言。在麻省理工学院、西北大学和芝加哥大学优秀研究生的帮助下(以及华盛顿大学和布朗大学的进一步帮助),在接下来的六个月里,我们开发了这些基本组件。

After a number of successful demonstrations, it was clear to a number of us that polystore systems would have their day in the sun. However, even with a successful demonstration, the underlying system still had no principled way of doing important tasks such as query optimization and data migration, and no clear query language. With the help of talented graduate students at MIT, Northwestern University, and University of Chicago (and further help from University of Washington and Brown University), over the next six months we developed these essential components.

将这些部分组合在一起是一项复杂的任务,涉及许多非常有趣的技术挑战。我们希望在系统中包含的一项功能是监控系统,它可以存储有关查询、查询计划和相关性能特征的信息 [Chen 等人。2016]。与此同时,西北大学的 Zuohao (Jack) She 正在研究一种跨多个系统开发查询计划并确定跨异构系统的语义等价性的技术 [She et al. 2017]。2016]。Jack 和 Peinan Chen(麻省理工学院)共同为每个唯一查询开发签名,存储该查询的性能信息,并存储这些结果以供将来使用。然后,当出现类似的新查询时,他们可以利用预运行的查询计划来跨多个系统执行查询(如果出现不同的查询,中间件将尝试运行尽可能多的查询计划,以便更好地了解可用于将来查询的性能特征)。另一个关键功能是能够显式或隐式跨多个系统迁移数据。Adam Dziedzic(芝加哥大学)做了很多繁重的工作,使这一功能成为现实 [She et al. 2016]。Ankush Gupta (MIT) 还开发了一种具有倾斜感知能力的执行引擎 [Gupta et al. 2017]。2016]。这些部分构成了 BigDAWG 中间件的第一个真正的实现 [Gadepally 等人。2016a]。另一个关键功能是能够显式或隐式跨多个系统迁移数据。Adam Dziedzic(芝加哥大学)做了很多繁重的工作,使这一功能成为现实 [She et al. 2016]。Ankush Gupta (MIT) 还开发了一种具有倾斜感知能力的执行引擎 [Gupta et al. 2017]。2016]。这些部分构成了 BigDAWG 中间件的第一个真正的实现 [Gadepally 等人。2016a]。另一个关键功能是能够显式或隐式跨多个系统迁移数据。Adam Dziedzic(芝加哥大学)做了很多繁重的工作,使这一功能成为现实 [She et al. 2016]。Ankush Gupta (MIT) 还开发了一种具有倾斜感知能力的执行引擎 [Gupta et al. 2017]。2016]。这些部分构成了 BigDAWG 中间件的第一个真正的实现 [Gadepally 等人。2016a]。

Putting these pieces together was a complicated task that involved a number of very interesting technical challenges. One feature we wanted to include in the system was a monitoring system that could store information about queries, their plans, and related performance characteristics [Chen et al. 2016]. At the same time, Zuohao (Jack) She at Northwestern was working on a technique to develop query plans across multiple systems and determine semantic equivalences across heterogeneous systems [She et al. 2016]. Jack and Peinan Chen (MIT) worked together to develop a signature for each unique query, store the performance information of that query, and store these results for future use. Then, when a similar new query came in, they could leverage a pre-run query plan in order to execute the query across multiple systems (if a dissimilar query came in, the middleware would attempt to run as many query plans as possible to get a good understanding of performance characteristics that could be used for future queries). Another key feature was the ability to migrate data across multiple systems either explicitly or implicitly. Adam Dziedzic (University of Chicago) did a lot of the heavy lifting to make this capability a reality [She et al. 2016]. Ankush Gupta (MIT) also developed an execution engine that is skew-aware [Gupta et al. 2016]. These pieces formed the first real implementation of the BigDAWG middleware [Gadepally et al. 2016a].

课。软件开发的分层方法可能是有利的:第一步证明一个想法,后续步骤提高解决方案的质量。

Lesson. A layered approach to software development can be advantageous: The first steps prove an idea and subsequent steps improve the quality of the solution.

精炼 BigDAWG

Refining BigDAWG

VLDB 和随后的演示向更广泛的社区展示了 Polystore 的概念不仅可行,而且充满潜力。虽然医疗数据集是一个很好的用例,但它并没有显示 BigDAWG 的工作规模。因此,我们开始寻找一个有趣的用例,展示自 VLDB 2015 以来的所有重大进展以及现实世界的大规模问题。经过几个月的搜索大型异构数据集且没有太多共享警告后,我们认识了麻省理工学院的一个研究小组,该小组由 Sallie (Penny) Chisholm、Steve Biller 和 Paul Berube 领导。奇泽姆实验室专门从事微生物海洋学和生物体的生物学分析。在世界各地的研究航行中,奇泽姆实验室收集水样,目的是了解海洋的新陈代谢。然后通过各种手段对这些样本进行分析。本质上,海水是从海洋的各个部分收集的,然后每个水样中的微生物都被收集在过滤器上、冷冻并运送到麻省理工学院。回到实验室,科学家们打开细胞并对这些生物体的 DNA 片段进行随机测序。该数据集包含数十亿个 FASTQ 格式 [Cock 等人。2009] 序列以及相关元数据,例如水样的位置、日期、深度和化学成分。每个部分都存储在不同的数据源(或平面文件)中。这似乎是 BigDAWG 的完美大规模用例。在四个月的时间里,我们改进了 BigDAWG 并开发了一组仪表板,Chisholm 团队可以使用它们来进一步研究。像之前一样,大部分集成工作是在 2016 年夏天在 MIT 举办的黑客马拉松中完成的。在 Chisholm 实验室研究人员的帮助下,我们能够使用 BigDAWG 系统高效地处理他们的大型数据集。该特定数据集面临的最大挑战之一是,由于其数量和多样性,在分析完整数据集方面所做的工作很少。戴上我们的分析帽子,在奇泽姆实验室研究人员的大力帮助下,我们能够开发出一组仪表板,他们可以使用它们更好地分析数据。到本次黑客马拉松结束时,我们已经集成了许多新技术,例如 S-Store [Meehan 等人。2015b],Macrobase [Bailis 等人。2017] 和 Tupleware [Crotty 等人。2015]。我们还有一个相对稳定的 BigDAWG 代码库以及一个闪亮的新演示!这些结果已在英特尔公布,并最终构成了 CIDR 2017 论文的基础 [Mattson 等人,2017 年。2017]。

The VLDB and subsequent demonstrations exhibited to the wider community that the concept of a polystore was not only feasible, but also full of potential. While the medical dataset was a great use-case, it did not show the scale at which BigDAWG could work. Thus, we set about searching for an interesting use-case that showcased all the great developments since VLDB 2015 as well as a real-world large-scale problem. After months of searching for large, heterogeneous datasets without too many sharing caveats, we were introduced to a research group at MIT led by Sallie (Penny) Chisholm, Steve Biller, and Paul Berube. The Chisholm Lab specializes in microbial oceanography and biological analysis of organisms. During research cruises around the world, the Chisholm Lab collects samples of water with the goal of understanding the ocean’s metabolism. These samples are then analyzed by a variety of means. Essentially, seawater is collected from various parts of the ocean, and then the microbes in each water sample are collected on a filter, frozen, and transported to MIT. Back in the lab, the scientists break open the cells and randomly sequence fragments of DNA from those organisms. The dataset contains billions of FASTQ-format [Cock et al. 2009] sequences along with associated metadata such as the location, date, depth, and chemical composition of the water samples. Each of these pieces is stored in disparate data sources (or flat files). This seemed like the perfect large-scale use-case for BigDAWG. Over the course of four months, we were able to refine BigDAWG and develop a set of dashboards that the Chisholm team could use to further their research. As before, the majority of the integration work was done in a hackathon hosted at MIT over the summer of 2016. With the help of Chisholm Lab researchers, we were able to use the BigDAWG system to efficiently process their large datasets. One of the largest challenges with this particular dataset was that, due to the volume and variety, very little work had been done in analyzing the full dataset. Putting our analytic hats on, and with significant help from Chisholm Lab researchers, we were able to develop a set of dashboards they could use to better analyze their data. By the end of this hackathon, we had integrated a number of new technologies such as S-Store [Meehan et al. 2015b], Macrobase [Bailis et al. 2017], and Tupleware [Crotty et al. 2015]. We also had a relatively stable BigDAWG codebase along with a shiny new demonstration! These results were presented at Intel and eventually formed the basis of a paper at CIDR 2017 [Mattson et al. 2017].

图像

图 31.4  麻省理工学院 Strata 中心的黑客马拉松 3。

Figure 31.4  Hackathon 3 at MIT Strata Center.

课。与最终用户的密切合作非常宝贵。他们提供领域专业知识,并可以帮助解决在此过程中可能出现的棘手问题。

Lesson. Working closely with end users is invaluable. They provide domain expertise and can help navigate tricky issues that may come up along the way.

BigDAWG 官方发布

BigDAWG Official Release

CIDR 演示结束时,人们强烈要求正式发布软件。这无疑是整个项目面临的更大挑战之一。虽然开发主要由内部人员使用的演示和代码已经足够具有挑战性,但我们现在必须开发一个可供外部人员使用的实现!这一阶段有很多目标:(1) 编写健壮且可供外部人员使用的代码;(2) 自动化 BigDAWG 的测试/构建流程(到目前为止,这都是由研究生手动处理的),(3) 开发单元测试和回归测试,以及 (4) 文档、文档和更多文档。幸运的是,我们能够利用麻省理工学院林肯实验室研究员凯尔·奥布莱恩的经验,他在开发软件版本方面拥有丰富的知识。

By the end of the CIDR demonstration, there were now loud requests for a formal software release. This was certainly one of the larger challenges of the overall project. While developing demonstrations and code that was to be mainly used by insiders was challenging enough, we now had to develop an implementation that could be used by outsiders! This phase had a number of goals: (1) make code that is robust and usable by outsiders; (2) automate test/build processes for BigDAWG (until now, that was handled manually by graduate students), (3) develop unit tests and regression tests, and (4) documentation, documentation, and more documentation. Fortunately, we were able to leverage the experience of MIT Lincoln Laboratory researcher Kyle O’Brien, who was knowledgeable in developing software releases. He quickly took charge of the code and ensured that the geographically distributed developers would have to answer to him before making any code changes.

在准备此版本的过程中,我们遇到了许多技术和非技术问题。为了说明一些复杂性,我记得有一个案例,我们花了很多时间想知道为什么数据无法在 Postgres 和 SciDB 之间正确迁移。从系统 A 转到系统 B 效果很好,独立完成时反之亦然。最后,我们意识到 SciDB 将维度表示为 64 位有符号整数,而 Postgres 允许许多不同的数据类型。因此,当迁移 int32 维度表示的数据时,SciDB 会自动将它们转换为 int64 整数;迁移回来将失去 ID 的唯一性。在很多情况下,我们都后悔选择使用 Docker 作为简化测试、构建和代码分发的工具。我们惨痛地认识到,Docker 虽然是一个出色的轻量级虚拟化工具,有许多已知的网络问题。由于我们使用 Docker 来启动数据库、中间件和许多其他组件,因此我们肯定有很多长篇大论试图调试出现错误的地方。这些电话非常频繁,以至于 Adam Dziedzic 记得有一个电话,有人在查找会议线路号码,而迈克只是从头顶念出会议号码和访问代码。

We ran into a number of technical and non-technical issues getting this release ready. Just to illustrate some of the complications, I recall a case where we spent many hours wondering why data would not migrate correctly between Postgres and SciDB. Going from system A to system B worked great, as did the reverse when independently done. Finally, we realized that SciDB represents dimensions as 64-bit signed integers and Postgres allows many different datatypes. Thus, when migrating data represented by int32 dimensions, SciDB would automatically cast them to int64 integers; migrating back would lose uniqueness of IDs. There were also many instances when we regretted our choice to use Docker as a tool to simplify test, build, and code distribution. We learned the hard way that Docker, while a great lightweight virtualization tool, has many known networking issues. Since we were using Docker to launch databases, middleware, and many other components, we definitely had a number of long telecons trying to debug where errors were coming up. These telecons were so frequent that Adam Dziedzic remembers a call where someone was looking up the conference line number and Mike just rattled the conference number and access code off the top of his head.

除了技术问题之外,开源软件的许可可能是一场噩梦。经过大量的文书工作,我们在代码发布前大约一周意识到,我们在非常核心级别(在查询规划器中计算树编辑距离)使用的库之一具有与我们打算使用的 BSD 许可证不兼容的许可证。因此,一夜之间,我们不得不重写这个组件,并在代码发布之前测试并重新测试所有内容!然而,最终,代码几乎按计划发布了(带有文档)。

Beyond technical issues, licensing open-source software can be a nightmare. After tons of paperwork, we realized about a week before our code release that one of the libraries we were using at a very core level (computing tree edit distances in the query planner) had an incompatible license with the BSD license we intended to use. Thus, overnight, we had to rewrite this component, and test and retest everything before the code release! Finally, however, the code was released almost on schedule (with documentation).

自第一个版本以来,我们有许多贡献者:Katherine Yu(麻省理工学院)使用 MySQL 和 Vertica 开发了连接器 [Yu 等人。2017];Matthew Mucklo(麻省理工学院)开发了新的 UI,并正在建造新的联邦岛;布朗大学的 John Meehan 和 Jiang Du 创建了一个流媒体岛,支持开源流媒体数据库 S-Store [Meehan 等人,2017]。2017]。

Since this first release, we’ve had a number of contributors: Katherine Yu (MIT) developed connectors with MySQL and Vertica [Yu et al. 2017]; Matthew Mucklo (MIT) developed a new UI and is building a new federation island; and John Meehan and Jiang Du at Brown University have created a streaming island with support for the open-source streaming database S-Store [Meehan et al. 2017].

课。研究代码和生产代码之间通常存在很大差距。利用经验丰富的开发人员的经验来实现这一飞跃非常有帮助。

Lesson. There is often a big gap between research code and production code. It is very helpful to leverage the experience of seasoned developers in making this leap.

BigDAWG 未来

BigDAWG Future

与 Mike 的许多项目相比,BigDAWG 的代码线相对较年轻。虽然目前很难判断该项目的长期影响,但短期内已经出现了许多令人鼓舞的迹象。BigDAWG 被选为著名的 R&D 100 奖的决赛入围者,我们已经开始围绕 Polystore 系统的概念形成一个研究社区 [Tan 等人,2017]。2017]以及研讨会和会议。例如,我们在 IEEE BigData 2016 和 2017 上组织了以 Polystore 为主题的研讨会,并将在 VLDB 2018 上组织类似的研讨会 (Poly'18)。我们使用 BigDAWG 作为会议教程的来源,并且多个小组正在研究BigDAWG 感谢他们的工作。展望未来,很难预测BigDAWG在技术上会走向何方,

Compared to many of Mike’s projects, BigDAWG has a relatively young codeline. While it is currently difficult to judge the long-term impact of this project, in the short term, there are many encouraging signs. BigDAWG was selected as a finalist for the prestigious R&D 100 Award, and we have started to form a research community around the concept of polystore systems [Tan et al. 2017] as well as workshops and meetings. For example, we have organized polystore-themed workshops at IEEE BigData 2016 and 2017 and will be organizing a similar workshop (Poly’18) at VLDB 2018. We’ve used BigDAWG as the source for tutorials at conferences, and several groups are investigating BigDAWG for their work. Looking further into the future, it is difficult to predict where BigDAWG will go technically, but it is clear that it has helped inspire a new era in database systems.

1 . “麻省理工学院的‘大数据’提案赢得了成为最新英特尔科技中心的全国竞赛”,2012 年 5 月 30 日。http://newsroom.intel.com/news-releases/mits-big-data-proposal-wins-national -最新英特尔科学技术中心竞赛/。上次访问时间为 2018 年 3 月 23 日。

1. “MIT’s ‘Big Data’ Proposal Wins National Competition to Be Newest Intel Science and Technology Center,” May 30, 2012. http://newsroom.intel.com/news-releases/mits-big-data-proposal-wins-national-competition-to-be-newest-intel-science-and-technology-center/. Last accessed March 23, 2018.

2 . Stonebraker, M.,The land sharks are on the squak box,ACM 图灵奖讲座(视频),联合计算研究会议,2015 年 6 月 13 日。

2. Stonebraker, M., The land sharks are on the squawk box, ACM Turing Award Lecture (video), Federated Computing Research Conference, June 13, 2015.

3 . http://github.com/bigdawg-istc/bigdawg。上次访问时间为 2018 年 3 月 23 日。

3. http://github.com/bigdawg-istc/bigdawg. Last accessed March 23, 2018.

第八部分

PART VIII

观点

PERSPECTIVES

32

32

IBM 关系数据库代码库1

IBM Relational Database Code Bases1

詹姆斯·汉密尔顿

James Hamilton

为什么有四个代码库?

Why Four Code Bases?

很少有服务器制造商有开发关系数据库管理系统的意愿和所需的资源。然而,IBM 已在内部开发并继续支持四种独立的、功能齐全的关系数据库产品。拥有大量客户群的生产质量 RDBMS 通常包含超过一百万行代码,并且需要数百甚至数千名工程师的多年努力。这些都是需要特殊技能的艰巨任务,因此有时有人问我:IBM 怎么可能最终拥有四个不共享组件的不同 RDBMS 系统?

Few server manufacturers have the inclination and the resources needed to develop a relational database management system. Yet IBM has internally developed and continues to support four independent, full-featured relational database products. A production-quality RDBMS with a large customer base typically is well over a million lines of code and represents a multi-year effort of hundreds and, in some cases, thousands of engineers. These are massive undertakings requiring special skills, so I’m sometimes asked: How could IBM possibly end up with four different RDBMS systems that don’t share components?

Mike Stonebraker 经常将多代码库问题称为 IBM 在数据库市场上犯的最大错误之一,因此值得研究一下它是如何形成的、可移植代码库在 IBM 是如何发展的,以及为什么 DB2 的可移植版本没有出现。从来没有一个强有力的选择来取代其他三个。

Mike Stonebraker often refers to the multiple code base problem as one of IBM’s biggest mistakes in the database market, so it’s worth looking at how it came to be, how the portable code base evolved at IBM, and why the portable version of DB2 wasn’t ever a strong option to replace the other three.

至少在我在 IBM 期间,人们经常谈论为所有受支持的硬件和操作系统开发单一 RDBMS 代码库。这种情况没有发生的原因至少部分是社会和历史原因,但也存在许多强大的技术挑战,这些挑战使得时光倒流并使用单一代码库变得困难。IBM 硬件和操作系统的多样性会减缓这一努力;深入利用独特的底层平台特性(例如 AS/400 上的单级存储或 System z 上的 Sysplex 数据共享)将使其真正具有挑战性;许多人使用的实现语言RDBMS 代码库并不存在于所有平台上;四个 IBM 数据库代码库的特性和功能上的差异使其更加不可行。经过这么多年的多样化发展和独特的优化,发布一个单一的代码库来统治它们几乎肯定无法与之前的版本实现功能和性能兼容。因此,IBM 有四种不同的关系数据库管理系统代码线,由四个不同的工程团队维护。

At least while I was at IBM, there was frequent talk of developing a single RDBMS code base for all supported hardware and operating systems. The reasons this didn’t happen are at least partly social and historical, but there are also many strong technical challenges that would have made it difficult to rewind the clock and use a single code base. The diversity of the IBM hardware and operating systems would slow this effort; the deep exploitation of unique underlying platform characteristics like the single-level store on the AS/400 or the Sysplex Data Sharing on System z would make it truly challenging; the implementation languages used by many of the RDBMS code bases don’t exist on all platforms; and differences in features and functionality across the four IBM database code bases make it even less feasible. After so many years of diverse evolution and unique optimizations, releasing a single code base to rule them all would almost certainly fail to be feature- and performance-compatible with prior releases. Consequently, IBM has four different relational database management system codelines, maintained by four different engineering teams.

DB2/MVS(现在称为 Db2 for z/OS)是针对 z/OS 操作系统优化的出色产品,支持独特的 System z 功能,例如 Sysplex Coupling Facility。IBM 许多最重要的客户仍然依赖该数据库系统,移植到其他操作系统(例如 Windows、System i、UNIX 或 Linux)确实具有挑战性。将 Db2 for z/OS 替换为其他 IBM 关系代码库之一将更具挑战性。Db2 for z/OS 将在 IBM 大型机的整个生命周期中持续存在,并且不太可能被移植到任何其他平台,也不会被 IBM 内部的另一个 RDBMS 代码线取代。

DB2/MVS, now called Db2 for z/OS, is a great product optimized for the z/OS operating system, supporting unique System z features such as the Sysplex Coupling Facility. Many of IBM’s most important customers still depend on this database system, and it would be truly challenging to port to another operating system such as Windows, System i, UNIX or Linux. It would be even more challenging to replace Db2 for z/OS with one of the other IBM relational code bases. Db2 for z/OS will live on for the life of the IBM mainframe and won’t likely be ported to any other platform or ever be replaced by another RDBMS codeline from within IBM.

DB2/400,现在称为 Db2 for i,是 AS/400 的 IBM 关系数据库。该硬件平台最初称为 System/38,早在 1979 年就发布了,但仍然是许多现代操作系统功能的优秀示例。现在称为 System i,该服务器托管一个非常先进的操作系统,具有单级存储,其中内存和磁盘地址无法区分,并且对象可以在磁盘和内存之间透明地移动。它是一个基于功能的系统,其中的指针(无论是指向磁盘还是内存)都包含访问所引用对象所需的安全权限。System i 上的数据库利用了这些系统功能,使 Db2 for i 成为另一个系统优化且不可移植的数据库。与 Db2 for z/OS 一样,

DB2/400, now called Db2 for i, is the IBM relational database for the AS/400. This hardware platform, originally called the System/38, was released way back in 1979 but continues to be an excellent example of many modern operating system features. Now called System i, this server hosts a very advanced operating system with a single-level store where memory and disk addresses are indistinguishable and objects can transparently move between disk and memory. It’s a capability-based system where pointers, whether to disk or memory, include the security permissions needed to access the object referenced. The database on the System i exploits these system features, making Db2 for i another system-optimized and non-portable database. As with Db2 for z/OS, this code base will live on for the life of the platform and won’t likely be ported to any other platform or ever be replaced by another RDBMS codeline.

实际上,VM/CMS 和 DOS/VSE 操作系统有一个单一的 DB2 代码库。最初称为 SQL/数据系统,或更常见的是 SQL/DS(现在正式称为VSE 和 VM 的 Db2)),是原System Rresearch代码库的产品化。某些组件(例如执行引擎)与 System R 相比发生了相当大的变化,但系统的大部分部分直接从 IBM 圣何塞研究中心(后来成为 IBM Almaden 研究中心)开发的原始 System R 代码库发展而来。该数据库不是用广泛支持或可移植的编程语言编写的,而且最近它还没有像其他 IBM RDBMS 代码库那样进行深入的工程投资。但它确实仍在生产中使用,并继续得到充分支持。移植到其他 IBM 平台并不是一个好的选择,而且在替换的同时也非常困难。保持与 VM/CMS 和 DOS/VSE 上生产中的先前版本的兼容性。

There actually is a single DB2 code base for the VM/CMS and DOS/VSE operating systems. Originally called SQL/Data System or, more commonly, SQL/DS (now officially Db2 for VSE & VM), it is the productization of the original System Rresearch code base. Some components such as the execution engine have changed fairly substantially from System R, but most parts of the system evolved directly from the original System R code base developed at the IBM San Jose Research Center (later to become IBM Almaden Research Center). This database is not written in a widely supported or portable programming language, and recently it hasn’t had the deep engineering investment of the other IBM RDBMS code bases. But it does remain in production use and continues to be fully supported. It wouldn’t be a good choice to port to other IBM platforms and it would be very difficult to replace while maintaining compatibility with the previous releases in production on VM/CMS and DOS/VSE.

可移植代码库出现

The Portable Code Base Emerges

对于 OS/2 系统,IBM 编写了另一个关系数据库系统,但这一次它是用可移植语言编写的,并且对操作系统和硬件的依赖性较少。当 IBM 需要为 RS/6000 提供第五个 RDBMS 时,许多人认为移植OS/2 DBM代码库是最快、最有效的选择。作为该计划的一部分,1992 年初,OS/2 数据库管理器(也称为 OS/2 DBM)的开发工作从 OS/2 开发团队转移到位于多伦多的 IBM Software Solutions 开发实验室。多伦多的任务是继续支持和增强 OS/2 DBM,并将代码库移植到 RS/6000 上的 AIX。我们还继续在 Linux、Windows、HP/UX 和 Sun Solaris 上提供此代码库。

For the OS/2 system, IBM wrote yet another relational database system but this time it was written in a portable language and with fewer operating system and hardware dependencies. When IBM needed a fifth RDBMS for the RS/6000, many saw porting the OS/2 DBM code base as the quickest and most efficient option. As part of this plan, in early 1992 the development of OS/2 Database Manager (also called OS/2 DBM) was transferred from the OS/2 development team to the IBM Software Solutions development lab in Toronto. The Toronto mission was both to continue supporting and enhancing OS/2 DBM and to port the code base to AIX on the RS/6000. We also went on to deliver this code base on Linux, Windows, HP/UX, and Sun Solaris.

我于 1992 年 1 月开始参与这个项目,当时我们开始将 OS/2 DBM 代码库转移到多伦多实验室。这是一个激动人心的时刻。我们不仅将拥有一个可移植的 RDBMS 代码库并能够支持多个平台,而且我们还将支持非 IBM 操作系统,这对于当时的 IBM 来说确实是不寻常的。对我来说,这真的感觉像是“从事数据库业务”,而不是从事拥有出色数据库的系统业务。

My involvement with this project started in January 1992 shortly after we began the transfer of the OS/2 DBM code base to the Toronto lab. It was an exciting time. Not only were we going to have a portable RDBMS code base and be able to support multiple platforms but, in what was really unusual for IBM at the time, we would also support non-IBM operating systems. This really felt to me like “being in the database business” rather than being in the systems business with a great database.

然而,我们很快发现我们最大的客户确实在 OS/2 DBM 上苦苦挣扎,并向 IBM 的最高层抱怨。我记得我必须飞往芝加哥会见一位重要客户,他对 OS/2 数据库管理器的稳定性非常不满。当我把车停在他们大楼前时,一架直升机降落在草坪上,载着从总部飞来参加会议的 IBM 高管。我知道这将是一次漫长而艰难的会议,事实确实如此。

However, we soon discovered that our largest customers were really struggling with OS/2 DBM and were complaining to the most senior levels at IBM. I remember having to fly into Chicago to meet with an important customer who was very upset with OS/2 Database Manager stability. As I pulled up in front of their building, a helicopter landed on the lawn with the IBM executives who had flown in from headquarters for the meeting. I knew that this was going to be a long and difficult meeting, and it certainly was.

我们知道我们必须快速稳定此代码,但我们也向 IBM 软件解决方案领导层做出了在 RS/6000 上快速投入生产的承诺。我们对代码库了解得越多,挑战就显得越困难。代码库不稳定,性能不佳,在任何维度上也无法很好地扩展。很明显,我们要么必须选择不同的代码库,要么快速对此代码库进行重大更改。

We knew we had to get this code stable fast, but we also had made commitments to the IBM Software Solutions leadership to be in production quickly on the RS/6000. The more we learned about the code base, the more difficult the challenge looked. The code base wasn’t stable and didn’t perform well, nor did it scale well in any dimension. It became clear we either had to choose a different code base or make big changes to this one quickly.

有很多事情要做,但时间却很少。压力越来越大,当 IBM Almaden 数据库研究团队介入时,我们正在从各种来源寻找其他解决方案。他们提出让整个 Almaden 数据库研究团队参与该项目,目标是取代 OS/2 DBM优化器和带有 Starburst 研究数据库组件的执行引擎,帮助解决我们目前在该领域遇到的扩展和稳定性问题。对于任何开发团队来说,接受研究代码库都是危险的一步,但该提案的不同之处在于,作者将伴随代码库。IBM Almaden Research 的 Pat Selinger 基本上让我们相信,我们将拥有世界一流的优化器和执行引擎,以及 Pat、Bruce Lindsay、Guy Lohman、C. Mohan、Hamid Pirahesh、John McPherson、Don Chamberlin、结构化查询语言的共同发明者,以及 IBM Almaden 数据库研究团队的其他成员。整个团队与多伦多团队并肩工作,使这个产品取得了成功。

There was a lot to be done and very little time. The pressure was mounting and we were looking at other solutions from a variety of sources when the IBM Almaden database research team jumped in. They offered to put the entire Almaden database research team on the project, with the goal of replacing the OS/2 DBM optimizer and execution engine with Starburst research database components and helping to solve scaling and stability problems we were currently experiencing in the field. Accepting a research code base is a dangerous step for any development team, but this proposal was different in that the authors would accompany the code base. Pat Selinger of IBM Almaden Research essentially convinced us that we would have a world-class optimizer and execution engine and the full-time commitment of Pat, Bruce Lindsay, Guy Lohman, C. Mohan, Hamid Pirahesh, John McPherson, Don Chamberlin, the co-inventor of the Structured Query Language, and the rest of the IBM Almaden database research team. This entire team worked shoulder to shoulder with the Toronto team to make this product successful.

决定走这条路。大约在我们做出这个决定的同时,我们刚刚在 RS/6000 上启动数据库,发现使用 TPC-B 测量它每秒只能处理 6 个事务 (TPS)。当时该平台上的性能领先者 Informix 能够提供 69 TPS。这是一个令人难以置信的困难消息,因为新的 Starburst 优化器虽然对于更复杂的关系工作负载至关重要,但实际上对 TPC-B 基准的简单事务性能没有影响。

The decision was made to take this path. At around the same time we were making that decision, we had just brought the database up on the RS/6000 and discovered that it was capable of only six transactions per second (TPS) measured using TPC-B. The performance leader on that platform at the time, Informix, was able to deliver 69 TPS. This was incredibly difficult news in that the new Starburst optimizer, although vital for more complex relational workloads, would have virtually no impact on the simple transactional performance of the TPC-B benchmark.

我记得当时我想退出,因为我想,由于我们进入 UNIX 数据库市场的时间较晚,这种糟糕的表现会给我们带来什么后果。我从椅子上站起来,穿过走廊来到珍妮特·佩尔纳的办公室。Janet当时是IBM数据库的领导者,负责所有平台上的所有IBM数据库产品。我记得走进珍妮特的办公室——或多或少没有注意到她已经在和某人见面——脱口而出:“我们有一个大问题。” 她询问详情。珍妮特对所有问题都采取了典型的“只管完成”的态度,她说:“好吧,我们只需要解决它就可以了。汇集多伦多和阿尔马登最优秀的团队,每周进行报告。” 珍妮特是一位令人难以置信的领导者,如果没有她的信任和支持,我不确定我们是否会开始这个项目。事情看起来太惨淡了。

I remember feeling like quitting as I thought through where this miserable performance would put us as we made a late entrance to the UNIX database market. I dragged myself up out of my chair and walked down the hall to Janet Perna’s office. Janet was the leader of IBM Database at the time and responsible for all IBM database products on all platforms. I remember walking into Janet’s office—more or less without noticing she was already meeting with someone—and blurting out, “We have a massive problem.” She asked for the details. Janet, typical of her usual “just get it done” approach to all problems, said, “Well, we’ll just have to get it fixed then. Bring together a team of the best from Toronto and Almaden and report weekly.” Janet is an incredible leader and, without her confidence and support, I’m not sure we would have even started the project. Things just looked too bleak.

绩效改进项目不是一次惩罚性或无回报的“长征”,而是我职业生涯中最好的经历之一。在接下来的六个月中,多伦多/阿尔马登联合团队将性能最差的数据库管理系统转变为最好的。当我们当年晚些时候发布经过审计的 TPC-B 性能时,它是 RISC System/6000 平台上性能最佳的数据库管理系统。

Instead of being a punishing or an unrewarding “long march,” the performance improvement project was one of the best experiences of my career. Over the course of the next six months, the joint Toronto/Almaden team transformed the worst performing database management system to the best. When we published our audited TPC-B performance later that year, it was the best-performing database management system on the RISC System/6000 platform.

正是在这次表演工作中,我真正开始依赖布鲁斯·林赛。我曾经开玩笑说说服布鲁斯做任何事情几乎是不可能的,但是,一旦他相信这是正确的事情,他就可以自己取得与任何中型工程团队一样多的成就。我从未见过太大的问题布鲁斯. 多年来,他多次拯救了我的屁股[SK1][MLB2],尽管我给他买了很多啤酒,但我可能还欠他一些。

It was during this performance work that I really came to depend upon Bruce Lindsay. I used to joke that convincing Bruce to do anything was nearly impossible, but, once he believed it was the right thing to do, he could achieve as much by himself as any mid-sized engineering team. I’ve never seen a problem too big for Bruce. He’s saved my butt[SK1][MLB2] multiple times over the years and, although I’ve bought him a good many beers, I still probably owe him a few more.

多伦多/阿尔马登特设性能团队做了出色的工作,早期的努力不仅挽救了产品的市场,而且巩固了两个工程团队之间的信任。在随后的几年里,我们交付了许多出色的功能,并共同取得了很多成就。

The ad hoc Toronto/Almaden performance team did amazing work and that early effort not only saved the product in the market but also cemented the trust between the two engineering teams. Over subsequent years, many great features were delivered and much was achieved together.

许多 OS/2 DBM 质量和扩展问题都是由于所有连接的用户在同一数据库地址空间中运行的进程模型造成的。我们知道这需要改变。Matt Huras、Tim Vincent 和他们领导的团队完全将数据库进程模型替换为每个数据库连接都有自己的进程,并且每个数据库连接都可以访问大型共享缓冲池。这为我们提供了可靠运行所需的故障隔离。该团队还保留了在操作系统线程中运行的能力,并支持大于 4GB 的寻址,即使我们当时使用的所有操作系统都是 32 位系统。这项工作极大地提高了数据库性能和稳定性。和,

Many of the OS/2 DBM quality and scaling problems were due to a process model where all connected users ran in the same database address space. We knew that needed to change. Matt Huras, Tim Vincent, and the teams they led completely replaced the database process model to one where each database connection had its own process and each could access a large shared buffer pool. This gave us the fault isolation needed to run reliably. The team also kept the ability to run in operating system threads, and put in support for greater than 4GB addressing even though all the operating systems we were using at the time were 32-bit systems. This work was a massive improvement in database performance and stability. And, it was a breath of fresh air to have the system stabilized at key customer sites so we could focus on moving the product forward and functionally improving it with a much lower customer support burden.

这个最初为 OS/2 编写的年轻代码库面临的另一个问题是每个数据库表都存储在自己的文件中。该模型有一些缺点,但可以使其运行得相当好。绝对行不通的是,任何表都不能超过 2GB。即使在当时,一个表不能超过 2GB 的数据库系统在 Unix 数据库市场上也几乎注定要失败。

Another problem we faced with this young code base, originally written for OS/2, was that each database table was stored in its own file. There are some downsides to this model, but it can be made to work fairly well. What was absolutely unworkable was that no table could be more than 2GB. Even back then, a database system where a table could not exceed 2GB would have been close to doomed in the Unix database market.

此时,我们已经接近承诺的交付日期。Toronto 和 Almaden 团队集体解决了原始 OS/2 DBM 代码库的所有主要问题,并且我们让它在 OS/2 和 AIX 平台上运行良好。我们还可以相当轻松地支持其他操作系统和平台。但我们还没有找到解决方法的一个问题是 2GB 表大小限制。

At this point, we were getting close to our committed delivery date. The collective Toronto and Almaden teams had fixed all the major problems with the original OS/2 DBM code base and we had it running well on both the OS/2 and AIX platforms. We also could support other operating systems and platforms fairly easily. But the one problem we just hadn’t found a way to address was the 2GB table size limit.

当时,我是该产品的首席架构师,强烈认为我们需要在发货前解决 2GB 的表大小限制。我大声地提出这个论点,但最好的反驳是我们根本没时间。任何合理的重新设计都会大大延迟我们承诺的产品发货日期。估计时间范围为 9 到 12 个月,许多人认为,如果我们对存储引擎进行如此大规模的更改,可能会出现更大的偏差。

At the time I was lead architect for the product and felt very strongly that we needed to address the table size limitation of 2GB before we shipped. I was making that argument vociferously, but the excellent counter argument was we were simply out of time. Any reasonable redesign would have delayed us significantly from our committed product ship dates. Estimates ranged from 9–12 months, and many felt bigger slips were likely if we made changes of this magnitude to the storage engine.

我仍然无法接受发布具有这种扩展限制的 UNIX 数据库产品的前景,因此我最终花了一个长周末并编写了对支持大于 2GB 表的原始方法的支持。原来不是一个漂亮的解决方案,但漂亮的解决方案已经被广泛研究,只是无法足够快地实施。我所做的是在物理表管理器下面实现一个虚拟化层,允许在多个文件上实现一个表。这不是最优雅的解决方案,但肯定是最方便的。它使大部分存储引擎保持不变,并且在打开文件后,它对性能几乎没有负面影响。运行此代码并能够通过我们的完整回归测试套件,我们决定在发货前取消 2GB 表大小限制。

I still couldn’t live with the prospect of shipping a UNIX database product with this scaling limitation, so I ended up taking a long weekend and writing support for a primitive approach to supporting greater-than-2GB tables. It wasn’t a beautiful solution, but the beautiful solutions had been investigated extensively and just couldn’t be implemented quickly enough. What I did was implement a virtualization layer below the physical table manager that allowed a table to be implemented over multiple files. It wasn’t the most elegant of solutions, but it certainly was the most expedient. It left most of the storage engine unchanged and, after the files were opened, it had close to no negative impact on performance. Having this code running and able to pass our full regression test suite swung the argument the other way and we decided to remove the 2GB table size limit before shipping.

当我们发布该产品时,我们在 AIX 上拥有使用 TPC-B 测量的世界上最快的数据库。我们还拥有一个非常可用的系统的基础,以前威胁采取法律行动的客户变成了快乐的参考客户。不久之后,我们发布了新的 Starburst 优化器和查询引擎,进一步增强了该产品。

When we released the product, we had the world’s fastest database on AIX measured using TPC-B. We also had the basis for a very available system, and the customers that were previously threatening legal action became happy reference customers. Soon after, we shipped the new Starburst optimizer and query engine further strengthening the product.

期待

Looking Forward

这个数据库变得非常成功,我很高兴在许多版本中使用它。它仍然是我工作生涯中最好的工程经历之一。多伦多和阿尔马登的联合团队是我合作过的最无私、最有才华的工程师团队之一。时任 IBM 数据库负责人的珍妮特·佩尔纳 (Janet Perna) 是一位独特的领导者,她让我们所有人都变得更好,有着令人难以置信的高标准,但从来都不是您有时听到的那个糟糕的老板。Matt Huras、Tim Vincent、Al Comeau、Kathy McKnight、Richard Hedges、Dale Hagen、Berni Schieffer 以及优秀的多伦多 DB2 团队的其他成员不惧怕挑战,并且知道如何为客户提供可靠运行的系统。Pat Selinger 是一位出色的领导者,他帮助组建了世界一流的 IBM Almaden 数据库研究团队,并让我们产品团队中的所有人都充满信心。Bruce Lindsay、C. Mohan、Guy Lohman、John McPherson、Hamid Pirahesh、Don Chamberlin 以及 Almaden 数据库研究团队的其他成员都是杰出的数据库研究人员,他们总是愿意卷起袖子,做有时看似单调的工作约占交付高质量生产系统所需资源的 90%。例如,IBM 院士和基于成本的关系数据库优化器的发明者 Pat Selinger 花费了大量时间编写测试计划和一些测试,用于使系统稳定并准备好部署到生产中。Almaden 数据库研究团队的其他成员都是出色的数据库研究人员,他们总是愿意卷起袖子,做有时单调的工作,而这些工作似乎占了交付高质量生产系统所需工作的 90%。例如,IBM 院士和基于成本的关系数据库优化器的发明者 Pat Selinger 花费了大量时间编写测试计划和一些测试,用于使系统稳定并准备好部署到生产中。Almaden 数据库研究团队的其他成员都是出色的数据库研究人员,他们总是愿意卷起袖子,做有时单调的工作,而这些工作似乎占了交付高质量生产系统所需工作的 90%。例如,IBM 院士和基于成本的关系数据库优化器的发明者 Pat Selinger 花费了大量时间编写测试计划和一些测试,用于使系统稳定并准备好部署到生产中。

This database became quite successful and I enjoyed working on it for many releases. It remains one of the best engineering experiences of my working life. The combined Toronto and Almaden teams are among the most selfless and talented group of engineers with which I’ve ever worked. Janet Perna, who headed IBM Database at the time, was a unique leader who made us all better, had incredibly high standards, and yet never was that awful boss you sometimes hear about. Matt Huras, Tim Vincent, Al Comeau, Kathy McKnight, Richard Hedges, Dale Hagen, Berni Schieffer, and the rest of the excellent Toronto DB2 team weren’t afraid of a challenge and knew how to deliver systems that worked reliably for customers. Pat Selinger is an amazing leader who helped rally the world-class IBM Almaden database research team and kept all of us on the product team believing. Bruce Lindsay, C. Mohan, Guy Lohman, John McPherson, Hamid Pirahesh, Don Chamberlin, and the rest of the Almaden database research team are all phenomenal database researchers who were always willing to roll up their sleeves and do the sometimes monotonous work that seems to be about 90% of what it takes to ship high-quality production systems. For example, Pat Selinger, an IBM Fellow and inventor of the relational database cost-based optimizer, spent vast amounts of her time writing the test plan and some of the tests used to get the system stable and ready to deploy into production with confidence.

IBM 每年继续从其数据库产品中赚取数十亿美元的收入,因此很难将这些代码库称为非凡的成功。可能有人认为,使用单一代码库可以更有效地应用工程资源。我认为这是事实,但市场份额比工程效率更重要。为了更快地扩大市场份额,最好将数据库工程、营销和销售资源集中到更早、更有针对性地在非 IBM 平台上销售 DB2。确实,Windows 长期以来一直位于 DB2 支持的平台列表中,但 IBM 始终在其自己的平台上进行最有效的销售。今天仍然如此。DB2 可在领先的云计算平台上使用,但同样,IBM 的大部分销售和工程资源仍然投资于自己有竞争力的云平台。IBM 平台的成功始终领先于 IBM 数据库的成功。通过这种模型,IBM 数据库的成功将始终与 IBM 服务器平台的市场份额挂钩。如果没有巨大的平台成功,IBM 的数据库市场份额就不可能增长。

IBM continues to earn billions annually from its database offerings, so it’s hard to refer to these code bases as anything other than phenomenal successes. An argument might be made that getting to a single code base could have allowed the engineering resources to be applied more efficiently. I suppose that is true, but market share is even more important than engineering efficiency. To grow market share faster, it would have been better to focus database engineering, marketing, and sales resources to selling DB2 on non-IBM platforms earlier and with more focus. It’s certainly true that Windows has long been on the DB2-supported platforms list, but IBM has always been most effective selling on its own platforms. That’s still true today. DB2 is available on the leading cloud computing platform but, again, most IBM sales and engineering resources are still invested in their own competitive cloud platform. IBM Platform success is always put ahead of IBM database success. With this model, IBM database success will always be tied to IBM server platform market share. Without massive platform success, there can’t be database market share growth at IBM.

图像

图 32.1  来自 DB2 Toronto 的许多领导者。站着的从左到右分别是杰夫·高斯、迈克·维纳、萨姆·莱特斯通、蒂姆·文森特和马特·胡拉。坐着的人从左到右分别是戴尔·哈根 (Dale Hagen)、伯尼·席弗 (Berni Schiefer)、伊万·卢 (Ivan Lew)、赫歇尔·哈里斯 (Herschel Harris) 和凯利·施兰姆 (Kelly Schlamb)。

Figure 32.1  Many leaders from DB2 Toronto. Standing, from left to right, are Jeff Goss, Mike Winer, Sam Lightstone, Tim Vincent, and Matt Hura. Sitting, from left to right, are Dale Hagen, Berni Schiefer, Ivan Lew, Herschel Harris, and Kelly Schlamb.

1 . 本章的一个版本之前已于 2017 年 12 月发布在 James Hamilton 的 Perspectives 博客中。http://perspectives.mvdirona.com/2017/12/1187。上次访问时间为 2018 年 3 月 5 日。

1. A version of this chapter was previously published in James Hamilton’s Perspectives blog in December 2017. http://perspectives.mvdirona.com/2017/12/1187. Last accessed March 5, 2018.

33

33

Aurum:一个关于研究品味的故事

Aurum: A Story about Research Taste

劳尔·卡斯特罗·费尔南德斯

Raul Castro Fernandez

本节中的大多数章节(建筑系统的贡献)描述了从研究实验室开始并成为成功公司的基础的系统。本章重点关注研究生命周期的早期阶段:尚不清楚研究想法是否会走出实验室进入现实世界的不确定时期。我使用 Aurum 作为示例,它是 Data Civilizer 项目的一部分的数据发现系统(请参阅第 23 章))。我不会给出 Aurum 的技术概述或解释该系统的用途 - 仅提供一些上下文所需的最低限度。相反,这是一个关于系统背景下的研究品味的故事。具体来说,这是我在麻省理工学院与迈克·斯通布雷克(Mike Stonebraker)共事两年半以来所学到的关于研究品味的总结。

Most chapters in this section, Contributions from Building Systems, describe systems that started in the research laboratory and became the foundation for successful companies. This chapter focuses on an earlier stage in the research lifecycle: the period of uncertainty when it is still unclear whether the research ideas will make it out of the laboratory into the real world. I use as an example Aurum, a data discovery system that is part of the Data Civilizer project (see Chapter 23).I do not give a technical overview of Aurum or explain the purpose of the system—only the minimum necessary to provide some context. Rather, this is a story about research taste in the context of systems. Concretely, it’s a summary of what I have learned about research taste in the two-and-a-half-plus years that I have worked with Mike Stonebraker at MIT.

在人们可以采取的许多研究方向中,我专注于我所谓的“新系统”,即如何设想工件来解决没有明确成功指标的不明确的问题。Aurum 就属于这一类。在这一类别中,我们可以进一步划分系统研究的空间。在一种极端情况下,人们可以提出一个问题,编写一种算法,用一些综合生成的数据进行尝试,然后将其称为系统。我不认为这是一个有趣的研究哲学,根据我的经验,迈克也不是(见第 10 章和第 11 章));祝任何进入迈克办公室并提出类似建议的人好运。假设“新系统”的最低要求是所产生的工件对设计该系统的研究人员或同一研究社区中的其他学术研究人员以外的其他人感兴趣。

Of the many research directions one can take, I focus on what I call “new systems,” that is, how to envision artifacts to solve ill-specified problems for which there is not a clear success metric. Aurum falls in this category. Within this category we can further divide the space of systems research. At one extreme, one can make up a problem, write an algorithm, try it with some synthetically generated data, and call it a system. I don’t consider this to be an interesting research philosophy and, in my experience, neither does Mike (see Chapters 10 and 11); good luck to anyone who comes into Mike’s office and suggests something along those lines. Let’s say the minimum requirement of a “new system” is that the resulting artifact is interesting to someone other than the researchers who design the system or other academic researchers in the same research community.

对“新系统”的研究从识别现有问题或用户痛点开始。下一步通常是确定问题存在的原因,并提出解决方案如何解决它的假设。该系统应该有助于在真实场景中测试假设,这样,如果系统运行良好,它应该可以缓解已识别的问题。通过 Aurum,我们试图测试帮助组织发现数据库、数据湖和云存储库中相关数据的想法。事实证明,“数据发现”是许多跨许多不同存储系统存储数据的公司中非常常见的问题。这会损害需要访问数据来完成日常任务的员工的生产力,例如填写报告、检查指标或查找填充机器学习模型特征所需的数据。

Research on “new systems” starts by identifying an existing problem or user pain point. The next step is usually to identify why the problem exists, and come up with a hypothesis for how to solve it. The system should help test the hypothesis in a real scenario, in such a way that if the system works well, it should alleviate the identified problem. With Aurum, we were trying to test ideas for helping organizations discover relevant data in their databases, data lakes, and cloud repositories. It turns out that “data discovery” is a very common problem in many companies that store data across many different storage systems. This hurts the productivity of employees that need access to data for their daily tasks, e.g., filling in a report, checking metrics, or finding data necessary for populating the features of a machine learning model.

所以 Aurum 的故事就是从这个“数据发现”问题开始的。第一步涉及与不同组织召开会议,以了解他们如何思考数据发现问题以及他们正在采取哪些措施来解决或避免该问题。如果有人关心在实际用例中提供实际帮助,那么这个阶段是“至关重要的”。通常,您会发现研究论文声称某些问题领域并引用另一篇研究论文。这完全没问题;许多研究成果直接建立在先前研究的基础上。然而,很多时候这些说法都是模糊且可疑的,例如,“在大数据时代,组织需要能够在没有电力的情况下在水下运行的系统。” 然后,研究人员引用了一些现有的论文。他们可能还没有与使用他们正在设计的系统的人交谈,而是依靠之前一些论文的动机来支撑他们的贡献。很容易看出这种情况是如何迅速失控的。引用以前的贡献是可以的,引用以前的结果也是可以的。引用以前的动机应该会引起人们的注意,但通常不会。无论如何,事实证明,如果你与真正的客户交谈,他们就会遇到很多问题。这些问题可能不是你所期望的,但它们足够难来激发有趣的研究方向。

So the story of Aurum started with this “data discovery” problem. The first steps involved setting up meetings with different organizations to understand how they were thinking about their data discovery problem and what they were doing to solve or avoid it. This stage is “crucial” if one cares about actually helping in a real use-case. Often, you find research papers that claim some problem area and cite another research paper. This is perfectly fine; lots of research results are built directly on top of previous research. However, many times the claims are vague and dubious, e.g., “In the era of big data, organizations need systems that can operate underwater without electricity.” Then, the researchers cite some existing paper. They probably have not talked to the people who would use the system that they are designing, but rather rely on the motivation of some previous paper to ground their contributions. It’s easy to see how this quickly gets out of hand. Citing previous contributions is OK, citing previous results is OK. Citing previous motivations should raise eyebrows, but it often does not. In any case, it turns out that if you talk to real customers, they have a ton of problems. These problems may not be the ones you expect, but they are hard enough to motivate interesting research directions.

有了由当前问题引发的初始需求列表,人们就可以继续设计一个系统。接下来是一系列积极的想法和实现,这些想法和实现很快被原型化、构建,然后被丢弃。这是因为一开始的要求是模糊的。随着新需求的出现和现有需求变得越来越明确,人们必须始终挑战假设并调整原型。这是非常困难的。

With an initial list of requirements motivated by the problem at hand, one moves on to design a system. What then follows is an aggressive back and forth of ideas and implementations that are quickly prototyped, built, and then discarded. This is because at the beginning the requirements are vague. One must always challenge the assumptions and adjust the prototype as new requirements emerge and existing ones become more and more defined. This is remarkably difficult.

第一阶段的另一个挑战是技术要求与业务规则或特定组织的特性混合在一起。通常,通过与许多不同的公司交谈有助于提炼出根本问题。最初我错误地认为只要 Aurum 对原始数据源具有只读访问权限,就有可能设计一个可以重复使用的系统读取该数据。事实证明,对于要访问组织内许多数据源的系统来说,仅读取一次数据是一个非常理想的属性,它可以减少终端系统的开销,降低云的成本等。相同的数据不止一次,您设计数据结构的方式以及如何组合系统的方式发生了根本性的变化。结果,系统将完全不同。当然,这个过程远非完美,因此您通常会完成系统的第一个原型,并发现缺少许多必要的功能。

Another challenge in this first stage is that the technical requirements are intermingled with business rules or idiosyncrasies of specific organizations. Often, it helps to distill the fundamental problems by talking to many different companies. Initially I wrongly assumed that as long as Aurum had read-only access to the original data sources, it would be possible to design a system that could repeatedly read that data. It turns out that reading data only once is a very desirable property of a system that is going to access many data sources within an organization—it reduces overhead on the end systems, it reduces costs in the cloud, etc. If you cannot read the same data more than once, the way you design your data structures and how you put together the system change fundamentally. As a result, the system would be completely different. Of course, this process is far from perfect, so you typically finish a first prototype of the system and find many necessary features are missing.

在设计原型时也会引入很多噪音,从简化的假设到未优化的部分,并造成严重的可用性问题。最重要的是,很容易优先考虑更有趣的技术挑战,而不是那些可能对最终用户产生更大影响的技术挑战。这个嘈杂的过程就是为什么尽快发布原型、向有真正问题的人展示原型并尽早接收反馈和印象如此重要的全部原因。这就是为什么坚持不懈地挑战自己的假设是如此重要。获得持续反馈是引导系统朝正确方向发展的关键。

A lot of noise is introduced while designing the prototype as well, from simplifying assumptions to pieces that are not optimized and pose serious usability issues. On top of that, it is easy to prioritize the more interesting technical challenges instead of those that may have more of an impact for end users. This noisy process is the whole reason why it’s so important to release prototypes soon, showing them to the people who have the real problem, and receiving feedback and impressions as early as possible. This is why it’s so important to be open to challenging one’s assumptions relentlessly. Getting continuous feedback is key to steering the system in the right direction.

那么什么是正确的方向呢?如果您从事研究,有人可能会说 21 世纪计算机科学的正确方向是优化最小可发布单元 (LPU) 粒度。换句话说,定义小问题,做彻底的技术工作,并写大量的论文,这些都是职业生涯进步所必需的。如今,这种方法增加了论文发表的机会,同时最大限度地减少了研究工作量。然而,这通常与进行有影响力的研究不一致。关注实际问题是一条风险更大的道路;仅仅因为一个人的目标是建立一个系统来解决一个实际问题并不意味着这个过程会成功,而且它肯定与研究界发表大量论文的期望不相容。

What then is the right direction? One may argue, if you are in research, that the right direction in computer science in the 21st century is to optimize the least publishable unit (LPU) grain. In other words, define small problems, do thorough technical work, and write lots of papers that are necessary to progress in one’s career. These days, this approach increases your chances of getting the paper published while minimizing the amount of effort that goes into the research. This, however, is generally at odds with doing impactful research. Focusing on real problems is a riskier path; just because one aims to build a system to solve a real problem does not mean the process will be successful, and it is definitely incompatible with the research community’s expectation of publishing many papers. This brings to the table two different philosophies for systems research: make it easy or make it relevant.

“正确”的风格取决于研究品味。我的研究品味与使研究具有相关性是一致的。这是与迈克一起工作时学到的主要内容之一。使研究具有相关性的主要缺点是这是一个痛苦的过程。它涉及到做大量的工作,但你知道这些工作不会对最终的研究论文产生重大影响。它带来了一些与研究或系统无关的挫败感。它会让你面临无数的批评来源,来自你的导师、你的合作者和系统的用户。当你向公众展示你的原型时,总会有很多人思考想法:原型足够好吗?他们的利益会一致还是会认为这是无关紧要和脱节的?除了这些想象中的挫败感和恐惧之外,还有真实存在的挫败感和恐惧。

The “right” style is a matter of research taste. My research taste is aligned with making research relevant. This is one of the main things you learn working with Mike. The major disadvantage of making research relevant is that it is a painful process. It involves doing a lot of work you know won’t have a significant impact in the resulting research paper. It brings a handful of frustrations that have no connection with either the research or the system. It exposes you to uncountable sources of criticism, from your mentors, your collaborators, and the users of the system. When you show your prototype to the public, there are always many ruminating thoughts: Will the prototype be good enough? Will their interests be aligned or will they think this is irrelevant and disconnected? On top of those imagined frustrations and fears, there are real ones.

我还清楚地记得有一次与一家大公司分析师的会面。我们已经合作了一段时间并讨论了公司内部的不同数据发现问题。我向他展示了 Aurum 的多个演示,因此我相信他们的兴趣是一致的。经过一番反复讨论后,我们同意在他们的一些内部数据库中尝试 Aurum。对于工业合作者来说,这是一个痛苦的过程,因为他们必须应对内部挑战,例如获得数据的正确权限、让法律团队制定适当的协议,以及无数其他我无法想象的障碍。这是Aurum第一次真正的考验。当我到达办公室时,我们煮咖啡——这始终是第一步。我总是喝一杯浓咖啡,所以我把它带到了会议室。我们坐在办公桌前,立即开始工作;我们希望最大限度地减少部署时间,以便专注于下一步要做的事情。我已经预加载了两个我可以访问的公共数据库,因此唯一剩下的就是包含内部数据库。我启动了实例,使用他们给我的凭据连接到数据库,然后启动了该进程。会议开始几分钟后,Aurum 就已经从内部数据库全速读取数据。我们一边享受咖啡,一边开始谈论一些想法,讨论其他有趣的技术。我刚喝了一口浓缩咖啡,就看到屏幕,发现明显出了问题。原本应该在百万个位数的变量现在已经达到了数千万,而且还在不断增长!

I still remember clearly a meeting with an analyst from a large company. We had been collaborating for a while and discussing different data discovery problems within the company. I had showed him multiple demos of Aurum, so I was confident that the interests were well aligned. After some back and forth we agreed to try Aurum in some of their internal databases. This is a painful process for industrial collaborators because they have to deal with internal challenges, such as obtaining the right permissions to the data, getting the legal team to set up appropriate agreements, and a myriad of other hurdles that I could not have imagined. This was the first real test for Aurum. When I arrived in the office, we made coffee—that is always the first step. I always have a long espresso, so that’s what I brought to the meeting room. We sat in our desks and started the job right away; we wanted to minimize the deployment time to focus on what we would do next. I had pre-loaded two public databases, which I had access to, so the only remaining bit was to include the internal database. I started the instance, connected to the database with the credentials they gave me, and fired up the process. A couple of minutes into our meeting, Aurum was already reading data from the internal database at full speed. We started chatting about some ideas, discussing other interesting technologies while enjoying our coffee. I had barely taken a sip of my espresso when I looked at the screen and saw that something was obviously very wrong. A variable which should have been ranging in the single-digit millions was tens of millions and growing!

此前,我使用在现场找到的大型数据集、在不同的测试场景下以及使用比我预期在公司中找到的更复杂的查询来测试 Aurum。然而,我忽略了一个基本参数,即词汇量。我知道部署即将被打破。问题的要点是 Aurum 正在构建一个与词汇量大小成比例的内部向量。只要向量适合内存,就没有问题。尽管我已经尝试使用 Aurum 处理大型数据集,但我没有考虑词汇量的大小。我们正在处理的数据库有数千万个不同的特定领域术语。会议开始十分钟后,进程就失败了。分析师非常客气地建议重新运行该流程,但由于了解内部问题,我说这无济于事。我需要改变一些根本性的东西。浪费了每个人的时间和资源的感觉令人沮丧。

Previously, I had tested Aurum using large datasets that I found in the field, under different test scenarios and using more complex queries than I expected to find in the company. However, I had overlooked a basic parameter, the vocabulary size. I knew that the deployment was poised to break. The gist of the problem was that Aurum was building an internal vector proportional to the size of the vocabulary. As long as the vector fits in memory, there is no problem. Although I had tried Aurum with large datasets, I did not account for the vocabulary size. The database that we were processing had tens of millions of different domain-specific terms. Ten minutes into the meeting the process had failed. The analyst, very graciously, proposed to rerun the process, but knowing the internal issue, I said that it would not help. There was something fundamental that I would need to change. The feeling of having wasted everybody’s time and resources was discouraging.

当你看到一个项目在你眼前失败时,你的脑海中会浮现出很多问题:我是否浪费了这个机会?我可以再尝试一次吗?即使曾经有过有效,这对研究有帮助吗?不过,唯一的出路就是回到办公室并在风暴中恢复供​​电。建立研究体系是艰巨的。您始终面临着了解您正在从事的工作是否有助于您试图解决的研究问题的压力。有一个声音一直在问:有人会关心这个吗?演示原型是一次痛苦的经历;人们自然会关注那些没有按预期工作的事情——这些事情可能没有在你的关注范围内,但却是应该处理的重要问题。您会收到各种各样的反馈,其中包括您感兴趣的类型以及其他可能不会立即相关的类型,但人们会提供这些反馈,并试图提供帮助。除了这些挫折之外,还有其他类型的外部压力也会增加。如今,如果您从事研究,构建系统并花时间使它们具有相关性并不一定符合倾向于重视更高论文数量的学术价值体系(请参阅第 11 章)。这会让你不断挑战自己的决定,并怀疑痛苦是否必要。

When you see a project fail in front of your eyes, a lot of questions come to mind: Did I waste this opportunity? Am I going to be able to try this again? Even if it had worked, would it have helped the research? The only way forward, though, is to get back to the office and power through the storm. Building research systems is arduous. There is a constant pressure to understand whether what you are working on is contributing to the research question you are trying to solve. There is a voice that keeps asking: Is somebody going to care about this anyway? It is a painful experience to demonstrate prototypes; naturally people focus on the things that do not work as expected—such things may not have been on your radar, but are the important points that should be handled. You receive all kinds of feedback, the type you are interested in and other types that may not be immediately relevant, but that people will deliver, trying to be helpful. On top of these frustrations, there are other kinds of external pressures that add to the mix. Today, if you are in research, building systems and taking the time to make them relevant is not necessarily aligned with an academic value system that tends to value higher paper counts (see Chapter 11). This makes you continually challenge your decisions, and wonder whether the pain is necessary.

迈克泰然自若地相信“使其具有相关性”是唯一前进的道路,这帮助我在遇到挫折的情况下保持在正轨上。上面的故事有一个美好的结局。几个月后我们重新部署了系统并成功加载了数据库。在此过程中我们学到了很多东西,更重要的是,我们使研究与实际问题相关。让研究变得有意义并不容易,但痛苦是值得的。听到新客户重复您以前处理过的问题,并注意到您能够多快地理解并提供帮助,这是令人满足的。创造出具有超越研究实验室影响力的潜力的东西甚至更令人满意。最终,这一切都归结为研究品味:使其变得简单或使其具有相关性。我已经选择了我的道路。

Mike’s unperturbed belief that “make it relevant” is the only way forward has helped me stay on track despite setbacks. The story above has a happy ending. We redeployed the system a few months later and loaded the database successfully. We learned a lot during the process and, more importantly, we made the research relevant to a real problem. Making research relevant is not easy, but the pain is worth it. Hearing a new customer echoing the problems you’ve been dealing with before, and noting how quickly you can empathize and help, is satisfying. Creating something with the potential of having an impact beyond the research lab is even more satisfying. Ultimately, it all boils down to research taste: make it easy or make it relevant. I’ve chosen my path.

34

34

尼斯:或者成为迈克的学生是什么感觉

Nice: Or What It Was Like to Be Mike’s Student

马蒂·赫斯特

Marti Hearst

作为一名研究人员,有三个人对我的成功至关重要,迈克·斯通布雷克是其中的第一个,也是最高的!我很高兴能够从我作为迈克以前的学生之一的角度向他表达敬意。

There are three people who were pivotal to my success as a researcher, and Mike Stonebraker was the first of these—and also the tallest! I am very pleased to be able to share in this tribute to Mike, from my perspective as one of his former students.

迈克确实是一个有远见的人。他不仅在数据库系统端处于领先地位,而且一直在尝试将其他领域引入DBMS。他试图让人工智能和数据库协同工作(好吧,这不是最成功的努力,但它变成了数据库触发器领域,非常成功)。他领导了将经济方法引入 DBMS 的早期工作,并且是用户界面和数据库领域的先驱。我记得他在 1993 年左右哀叹道,当 Tioga [Stonebraker 等人] 的工作开展时,数据库会议不接受有关用户界面的论文。1993b, 1993c] 被该社区拒绝。这项工作最终催生了 Informix 可视化界面,它再次领先于时代。此后不久,Oracle 的可视化团队就有了 100 多人。

Mike truly is a visionary. He not only led the way in the systems end of databases, but he was also always trying to bring other fields into DBMSs. He tried to get AI and databases to work together (OK, that wasn’t the most successful effort, but it turned into the field of database triggers, which was enormously successful). He led early efforts to bring economic methods to DBMSs, and was a pioneer in the area of user interfaces and databases. I remember him lamenting back around 1993 that the database conferences would not accept papers on user interfaces when the work on Tioga [Stonebraker et al. 1993b, 1993c] was spurned by that community. That work eventually led to the Informix visualization interface, which again was ahead of its time. Shortly after that, Oracle had more than 100 people working in its visualization group.

迈克不仅是一位有远见的人,而且还激励着周围的人。早在 1989 年,当我还是一名研究生时,我发现数据库对我的论文来说不够有趣,而是想从事自然语言处理 (NLP) 工作,迈克对我说:“如果你要处理文本,思考大文本。” 尽管我的 NLP 同事对我投以奇怪的目光,但我仍将其视为一项挑战,因此,我成为第一批在大型文本语料库上进行计算语言学工作的人之一。这种方法现在在该领域占据主导地位,但在当时几乎闻所未闻。

Not only is Mike a visionary, but he is also inspirational to those around him. Back around 1989, when I was a graduate student who didn’t find databases interesting enough for my dissertation and instead wanted to work on natural language processing (NLP), Mike said to me: “If you’re going to work with text, think BIG text.” I took that on as a challenge, despite the strange looks I got from my NLP colleagues, and as a result, was one of the first people to do computational linguistics work on large text corpora. That approach now dominates in the field, but was nearly unheard of at the time.

还有一次,在我获得了大文本的重要研究成果后,迈克说了一些大意的话,“我们需要的是无键盘的文本界面——为什么不解决这个问题呢?” 这让我开始思考文本搜索界面的可视化,这反过来又带来了我现在众所周知的几项发明,并最终我写了第一本关于该主题的学术书籍,搜索用户界面 [Hearst 2009 ] 。再次,正是迈克的愿景和他特殊的鼓励方式引导我走上了这条道路。

Another time, after I’d obtained significant research results with big text, Mike said something to the effect of, “What we need are keyboardless interfaces for text—why don’t you solve that?” This led me to start thinking about visualization for interfaces for text search, which in turn led to several inventions for which I am now well known, and eventually to my writing the very first academic book on the topic, Search User Interfaces [Hearst 2009]. Again, it was Mike’s vision and his special form of encouragement that led me down that path.

迈克还告诉我,世界著名的教授对于帮助学生解决阻碍研究的恼人的后勤问题并不太重要。1989 年,在网上获取大量文本非常困难。我仍然记得迈克帮助我从校园图书馆某个发霉的房间里的某个陈旧系统中下载了几十篇Sacramento Bee文章,并支付了 200 美元以实现这一目标。

Mike also taught me that a world-famous professor wasn’t too important to help a student with annoying logistical problems that were blocking research. In 1989, it was very difficult to get a large text collection online. I still remember Mike helping me download a few dozen Sacramento Bee articles from some archaic system in some musty room in the campus library, and paying the $200 to allow this to happen.

我第一次见到迈克时,我还是加州大学伯克利分校的一名本科生,当时我偶然参加了他关于下一代数据库的研讨会。我需要一个高级荣誉项目,尽管他刚刚认识我并且我没有参加数据库课程,但迈克立即建议我与他一起开展一个研究项目。他是我遇到的第一个也是唯一一个简单地认为我聪明有能力的计算机科学教授。回想起来,我认为Mike对我的态度让我相信我可以成为一名CS博士。学生。因此,尽管我怀疑迈克有时会给别人留下粗鲁或令人生畏的印象,但他对学生却始终给予支持。

I first met Mike when I was a UC Berkeley undergraduate who wandered into his seminar on next-generation databases. I needed a senior honors project, and even though he had just met me and I hadn’t taken the databases course, Mike immediately suggested I work with him on a research project. He was the first and only CS professor I’d encountered who simply assumed that I was smart and capable. In retrospect, I think that Mike’s attitude toward me is what made it possible for me to believe that I could be a CS Ph.D. student. So even though I suspect that sometimes Mike comes across as brusque or intimidating to others, toward students, he is unfailingly supportive.

作为这一点的进一步证据,就女性博士学位的数量而言。在学生的指导下,Mike 毕业于加州大学伯克利分校计算机科学系,1995 年,Mike 与 8 名女博士毕业生并列第一(如果我坚持使用数据库而不是转向 NLP,他就会是第一)。我不认为这是因为迈克有指导女学生的明确愿望,而是他只是支持那些对数据库感兴趣的人,并帮助他们成为最好的人,无论他们是谁,无论他们拥有什么技能跟他们。结果,他帮助许多人提升到了没有他就无法达到的水平,我是根据直接经验说的。

As further evidence of this, in terms of number of female Ph.D. students advised and graduated from the UC Berkeley CS department, in 1995 Mike was tied for first with eight female Ph.D.s graduated (and he’d have been first if I’d stuck with databases rather than switching to NLP). I don’t think this is because Mike had an explicit desire to mentor female students, but rather that he simply supported people who were interested in databases, and helped them be the very best they could be, whoever they were and whatever skills they brought with them. As a result, he helped elevate many people to a level they would never have reached without him, and I speak from direct experience.

迈克的巨大慷慨还体现在其他方面。我还记得,当他将研究项目Postgres转变为一家公司(Illustra Corporation)时,他做出了巨大的努力,以确保每一个为Postgres项目贡献代码的人在该项目上市之前都获得了该公司的一些股份,尽管他没有任何义务这样做。尽管我们中一些贡献少量代码的人直到最后都被忽视了,但他坚持在首次公开募股之前修改文件,以便更多的人获得股份。我发现很难想象其他人会这样做,但迈克非常公平和慷慨。

Mike’s enormous generosity is reflected in other ways. I still remember that when he converted the research project Postgres into a company (Illustra Corporation), he made a big effort to ensure that every person who had contributed code to the Postgres project received some shares in the company before it went public, even though he had no obligation whatsoever to do that. Although a few of us who contributed small amounts of code were overlooked until almost the very end, he insisted that the paperwork be modified right before the IPO so that a few more people would get shares. I find it hard to imagine anyone else who would do that, but Mike is extraordinarily fair and generous.

图像

图 34.1   Postgres'95 T 恤上的徽标。

Figure 34.1  Logo from the Postgres’95 t-shirt.

迈克也非常慷慨地分配研究学分。前面提到的本科生研究论文与 Postgres 规则系统有关。1987 年我作为研究生加入他的团队后,Mike 写了几篇关于该主题的思考文章 [Stonebraker and Hearst 1988,Stonebraker et al. 1988]。1989]并坚持在报纸上写上我的名字,尽管我不相信我添加了任何实质性内容。

Mike is also very generous with assigning research credit. The aforementioned undergraduate research thesis had to do with the Postgres rules system. After I joined his team as a graduate student in 1987, Mike wrote up a couple of thought pieces on the topic [Stonebraker and Hearst 1988, Stonebraker et al. 1989] and insisted on including my name on the papers even though I don’t believe I added anything substantive.

迈克的项目非常符合加州大学伯克利分校计算机科学系研究团队的传统,该团队由许多研究生、一些博士后研究人员和一些编程人员组成。Mike 为每个项目制作了徽标和 T 恤(参见图 34.1中的一个示例),并在他位于伯克利山的家中举办了一年一度的聚会,并在聚会上赠送了愚蠢的礼物,从而培养了一种社区意识。他以前的许多学生和教职员工都保持着不同程度的联系,而且正如研究生院常见的那样,许多浪漫关系最终发展为婚姻。

Mike’s projects were very much in the tradition of the UC Berkeley Computer Science Department’s research teams, consisting of many graduate students, some postdoctoral researchers, and some programming staff. Mike fostered a feeling of community, with logos and T-shirts for each project (see Figure 34.1 for one example) and an annual party at his house in the Berkeley hills at which he gave out goofy gifts. Many of his former students and staff stay in touch to various degrees, and, as is common in graduate school, many romantic relationships blossomed into eventual marriages.

简而言之,这就是迈克·斯通布雷克:富有远见、鼓舞人心、平等主义和慷慨。但你肯定在想:“嘿,这不可能是故事的全部!迈克作为研究顾问是不是有点可怕?” 好吧,答案是肯定的。

So that’s Mike Stonebraker in a nutshell: visionary, inspirational, egalitarian, and generous. But surely you are thinking: “Hey, that can’t be the whole story! Wasn’t Mike kind of scary as a research advisor?” Well, OK, the answer is yes.

我还记得当时我和新导师犹豫不决,无法决定论文题目,所以我约了我现在的前导师迈克谈谈。由于某种原因,我们安排在星期六,并且外面下着倾盆大雨。我仍然记得迈克出现在办公室,从他巨大的高度俯视着我,基本上说了这样的话:“你怎么了?只需选择一个主题并进行即可!” 从那天起,我就很好了,做研究没有任何问题。我发现对于大多数博士来说。学生们,他们的课程中有一个时刻需要“Just do it”演讲;否则,如果我没有看到这篇演讲对我产生的效果,我永远不会有勇气把它讲给学生们听。

I still remember the time when I was waffling around, unable to decide on a thesis topic with my new advisor, and so I made an appointment to talk with my now former advisor Mike. For some reason, we’d scheduled it on a Saturday, and it was pouring outside. I still remember Mike appearing at the office and looking down at me from his enormous height and basically saying something like, “What’s wrong with you? Just pick a topic and do it!” From that day on, I was just fine and had no problem doing research. I have found that for most Ph.D. students, there is one point in their program where they need this “just do it” speech; I’d otherwise never have had the guts to give it to students without seeing how well this speech worked on me.

我还记得迈克对想法采取的极端立场——主要是认为它们很糟糕。例如,我参加了 ODBMS 战争。我记得迈克非常自信地说面向对象不会与数据库相媲美:你需要这种混合对象关系的东西。他有一张四边形图表来证明这一点。好吧,他关于将专家系统放入数据库的观点是错误的,但他对于对象关系的事情最终肯定是正确的(参见第6 章)。

I also remember the extreme stances Mike would take about ideas—mainly that they were terrible. For instance, I was there for the ODBMS wars. I remember Mike stating with great confidence that object-oriented was just not going to cut it with databases: that you needed this hybrid object-relational thing instead. He had a quad chart to prove it. Well, he hadn’t been right about putting expert systems into databases, but he certainly ended up being right about this object-relational thing (see Chapter 6).

与许多伟大的智者一样,迈克非常希望人们推翻他的想法,以帮助每个人达成最佳理解。我记得迈克有好几次直截了当地说:“我的想法完全错误。” 这对研究生来说是一个很好的教训。这向他们表明,他们有机会改变这位重要教授的观点,即使这些观点是坚决持有的。当然,这是一个能够改变整个研究界乃至整个世界的观点的隐喻(见第 3 章)。

As with many great intellects, Mike very much wants people to push back on his ideas to help everyone arrive at the best understanding. I remember several occasions in which Mike would flatly state, “I was utterly and completely wrong about that.” This is such a great lesson for graduate students. It shows them that they have the opportunity to be the one to change the views of the important professor, even if those views are strongly held. And that of course is a metaphor for being able to change the views of the entire research community, and by extension, the world (see Chapter 3).

正如我提到的,迈克是一个沉默寡言的人,至少在电子邮件中是这样。这让你很容易就能看出你什么时候做了真正非常伟大的事情。那些与他共事过的人都知道,只有最好的想法或事件才能从迈克那里得到宝贵的回应。你会给他发一封电子邮件,在这种非常罕见的情况下,你会看到回复,这是最终的赞美:

As I mentioned, Mike is a man of few words, at least over email. This made it easy to tell when you’d done something really, truly great. Those of you who’ve worked with him know that treasured response that only the very best ideas or events can draw out of Mike. You’d send him an email and what you’d see back would be, on that very rare occasion, the ultimate compliment:

整洁的。

neat.

/麦克风

/mike

35

35

迈克尔·斯通布雷克:竞争对手、合作者、朋友

Michael Stonebraker: Competitor, Collaborator, Friend

唐·哈德勒

Don Haderle

我在 20 世纪 70 年代通过 Mike 的数据库技术工作认识了他,并在 1990 年代认识了他本人。这是竞争对手、合作者和朋友的视角。迈克是一位罕见的同时作为学者和企业家取得成就的人。此外,他是一个总是充满好奇、富有冒险精神和风趣的人。

I became acquainted with Mike in the 1970s through his work on database technology and came to know him personally in the 1990s. This is the perspective of a competitor, a collaborator, and a friend. Mike is a rare individual who has made his mark equally as an academic and an entrepreneur. Moreover, he stands out as someone who’s always curious, adventurous, and fun.

以下是我对 Mike 从数据库管理行业早期到今天的回忆。

What follows are my recollections of Mike from the early days of the database management industry through today.

IBM 研究员 Ted Codd 于 1969 年提出数据关系模型 [Codd 1970] 后,多个学术和行业研究实验室启动了创建语言和支持技术(包括事务管理和分析)的项目。IBM 的 System R [Astrahan 等人。1976] 和加州大学伯克利分校的 Ingres [Held 等人。1975] 成为两个最有影响力的项目(见第 13 章)。到了 20 世纪 70 年代中期,这两项努力都生产出了具体的原型。到 20 世纪 80 年代初,各种关系数据库管理系统已经进入商业市场。

After IBM researcher Ted Codd proposed the relational model of data in 1969 [Codd 1970], several academic and industry research laboratories launched projects that created language and supporting technologies, including transaction management and analytics. IBM’s System R [Astrahan et al. 1976] and UC Berkeley’s Ingres [Held et al. 1975] emerged as the two most influential projects (see Chapter 13). By the mid-1970s, both efforts produced concrete prototypes. By the early 1980s, various relational database management systems had reached the commercial market.

20 世纪 70 年代初期,我是一名 IBM 产品开发人员,专注于实时操作系统和流程控制、文件系统、销售点系统、安全系统等。1976 年,我加入了 IBM 的一个产品开发团队,该团队正在探索新的数据库技术,以响应强烈的客户需求,从而显着加快他们为快速变化的业务需求提供解决方案的时间。这是我进入数据库的洗礼,也是我第一次认识Mike。我如饥似渴地阅读了他关于新兴数据库技术的著作。

In the early 1970s, I was an IBM product developer focusing on real-time operating systems and process control, file systems, point-of-sales systems, security systems, and more. In 1976, I joined a product development team in IBM that was exploring new database technology that responded to intense customer demands to dramatically speed up the time it took them to provide solutions for their fast-moving business requirements. This was my baptism in database and my first acquaintance with Mike. I devoured his writings on nascent database technology.

在 20 世纪 70 年代,Mike 和 Ingres 团队在并发控制 [Stonebraker 1978, 1979b] 索引、安全性 [Stonebraker 和 Rubinstein 1976]、数据库语言以及分布式关系数据库 Ingres 的查询优化方面做出了开创性的工作 [Stonebraker 等人。1976b]。Ingres 的目标是 DEC 小型机,将一组这些机器组合起来处理大型数据库操作。相比之下,IBM System R 的目标是大型机,这些主机具有足够的能力来满足当时大多数企业工作的需要。正是这个研究项目构成了 IBM DB2 [Saracco 和 Haderle 2013] 大型机数据库的基础。

In the 1970s, Mike and the Ingres team developed seminal work in concurrency control [Stonebraker 1978, 1979b] indexing, security [Stonebraker and Rubinstein 1976], database language, and query optimization for the distributed relational database Ingres [Stonebraker et al. 1976b]. Ingres targeted DEC minicomputers, combining a set of those machines to address large database operations. By contrast, IBM System R targeted mainframes, which had adequate power for most enterprise work of the era; it was this research project that formed the basis of IBM’s DB2 [Saracco and Haderle 2013] mainframe database.

20 世纪 80 年代初,Mike 与 Larry Rowe 和 Gene Wong 成立了 Relational Technology, Inc.,并向市场推出了 Ingres 的商业版本。这使 Mike 成为我们的间接竞争对手(Ingres 面向与 DB2 不同的市场,并直接与 Oracle 竞争)。迈克保持不变。他分享了他所学到的知识,并且是一个愿意合作的人。

In the early 1980s, Mike formed Relational Technology, Inc. with Larry Rowe and Gene Wong, and delivered a commercial version of Ingres to the market. This made Mike our indirect competitor (Ingres addressed a different market than DB2 and competed directly against Oracle). Mike remained unchanged. He shared what he learned and was a willing collaborator.

迈克不仅作为一名学者,而且作为一名商业企业家给我留下了深刻的印象。迈克、拉里和吉恩三人必须学习如何创建和经营一家企业,同时仍然保持在加州大学伯克利分校的教授职位。这是一个不小的壮举,在融资、人员配备和日常运营方面苦苦挣扎,同时仍在推进大学的技术和出版。他们通过 Berkeley Software Distribution (BSD) 许可大学 Ingres 代码开创了开源,克服了当时大学许可安排的商业限制。他们得出的结论是,价值在于他们的经验,而不是安格尔代码本身。这种启示在当时是新颖的,为业界广泛使用开源铺平了道路(见第12章))。

Mike impressed me not only as an academic but also as a commercial entrepreneur. The trio of Mike, Larry, and Gene had to learn how to create and operate a business while still maintaining their professorial positions at UC Berkeley. This was no small feat, struggling through financing, staffing, and day-to-day operations while still advancing technology and publishing at the university. They pioneered open source through Berkeley Software Distribution (BSD) licensing of the university Ingres code, which overcame the commercial restrictions of the university licensing arrangement of the time. They came to the conclusion that the value was in their experiences, not in the Ingres code itself. This enlightenment was novel at the time and paved the way for widespread use of open source in the industry (see Chapter 12).

三人在他们的商业努力中犯了一些新手失误。Mike 和 Gene 开发了用于关系操作的 QUEL 语言,作为 Ingres 的一部分,而 IBM 则为 System R 开发了 SQL [Chamberlin 等人。1976]。关于哪一个更好,学术界存在争论。1982 年,美国国家标准协会 (ANSI) 开始认真开展关系数据库语言标准化工作。SQL是由IBM和Oracle提出并大力支持的。Mike 没有提交 QUEL,理由是将其交给标准委员会会限制 Ingres 团队的创新能力。虽然这是一个合理的学术决定,但它不是一个好的商业决定。到 1986 年,业界对 SQL 进行了标准化,使其成为与全球大多数企业和政府竞标关系数据库合同的要求。因此,Ingres 必须快速支持 SQL,否则就会输给他们的主要竞争对手 Oracle。起初,Ingres 团队在本机 QUEL 数据库之上模拟 SQL,但存在不良情况结果。真正的 SQL 版本需要进行重大的重新设计,并于 20 世纪 90 年代初首次亮相。这一失误让团队损失了五年多的市场时间。

The trio made some rookie missteps in their commercial endeavor. Mike and Gene had developed the QUEL language for relational operations as part of Ingres while IBM had developed SQL for System R [Chamberlin et al. 1976]. There were academic debates on which was better. In 1982, serious work began by the American National Standards Institute (ANSI) to standardize a relational database language. SQL was proposed and strongly supported by IBM and Oracle. Mike did not submit QUEL, rationalizing that putting it in the hands of a standards committee would limit the Ingres team’s ability to innovate. While that was a reasonable academic decision, it was not a good commercial decision. By 1986, the industry standardized on SQL, making it a requirement for bidding on relational database contracts with most enterprises and governments around the world. As a result, Ingres had to quickly support SQL or lose out to Oracle, their primary competitor. At first the Ingres team emulated SQL atop the native QUEL database but with undesirable results. The true SQL version required major reengineering and debuted in the early 1990s. This misstep cost the team five-plus years in the market.

除了数据库之外,Ingres 团队还开发了出色的工具来创建数据库和使用这些数据库的应用程序(请参阅第 15 章)。他们认识到数据库需要一个生态系统才能在市场上取得成功。Oracle 为其数据库创建了流行的应用程序(商业财务),以应对不断变化的市场,其中客户希望购买通用商业应用程序而不是自己构建应用程序。不幸的是,Mike 和团队无法说服投资者资助 Ingres 应用程序的开发,而且他们也很难说服应用程序供应商支持他们的新生数据库,特别是因为该团队不支持 SQL 标准接口。为了在商业上取得成功,这三人还需要学习更多东西。

Beyond the database, the Ingres team developed fantastic tools for creating databases and applications using those databases (see Chapter 15). They recognized that a database needed an ecosystem to be successful in the market. Oracle created popular applications (business financials) for their database, responding to a changing market wherein customers wanted to buy generic business applications rather than build the applications themselves. Unfortunately, Mike and the team couldn’t convince their investors to fund the development of applications for Ingres, and they had a difficult time convincing application vendors to support their nascent database, especially since the team did not support the standard interface of SQL. The trio had more to learn to succeed commercially.

20 世纪 80 年代初,商业界迅速采用关系数据库和各种信息技术,实现业务数字化。随之而来的是,除了第一代关系数据库支持的表格簿记数据之外,还需要包含新类型的数据。20 世纪 80 年代中期,面向对象数据库似乎迎接了这一挑战。研究人员探索了扩展关系数据模型以管理和操作新数据类型(例如时间序列、空间和多媒体数据)的方法。这些研究人员中的佼佼者是 Mike,他在加州大学伯克利分校发起了 Postgres 项目,旨在探索扩展 Ingres 来解决更多问题的新方法。(IBM 研究中心对 Starburst 项目做了类似的事情 [Haas et al. 1989])。的确,

Commerce rapidly adopted relational databases and all manner of information technology in the early 1980s, digitizing their businesses. With this came demand to include new types of data beyond the tabular bookkeeping data supported by the first-generation relational databases. In the mid-1980s, object-oriented databases appeared to take on this challenge. Researchers explored ways to extend the relational data model to manage and operate on new data types (e.g., time series, spatial, and multimedia data). Chief among such researchers was Mike, who launched the Postgres project at UC Berkeley to explore new and novel ways to extend Ingres to solve more problems. (IBM Research did similarly with the Starburst project [Haas et al. 1989]). Indeed, a presentation [Stonebraker 1986c] delivered by Larry Rowe and Mike in 1986 at an object-oriented conference in Asilomar, CA, inspired the rest of us in the relational community to step it up.

Mike 引领了对象关系模型的定义,推动了早期创新,并将其推向市场。20 世纪 90 年代初,Mike 创办了他的第二家公司 Illustra Corporation,将 Postgres 对象关系商业化。Illustra 1提供了出色的工具及其“数据刀片”,用于创建数据类型、在这些类型上构建函数,以及指定使用第三方为这些对象创建的存储方法,这些方法将产生超出第三方提供的存储方法的出色性能。基础 Illustra 服务器。迈克的公司再次证明,一个好的数据库需要一个强大的工具和应用程序生态系统。该技术将扩展关系数据库以处理地球物理数据、多媒体等。

Mike led the way in defining the Object-Relational model, drove the early innovations, and brought it to market. In the early 1990s, Mike started his second company, Illustra Corporation, to commercialize Postgres object-relational. Illustra1 offered superb tools with their “data blades” for creating data types, building functions on these types, and specifying the use of storage methods created by third parties for those objects that would yield great performance over and above the storage methods provided by the base Illustra server. Once again, Mike’s company demonstrated that a good database needs a great ecosystem of tools and applications. This technology would extend the relational database to handle geophysical data, multimedia, and more.

当 Mike 于 1986 年创建 Postgres 时,他在 QUEL 中添加了扩展性运算符 (PostQuel)。SQL 在 20 世纪 90 年代中期被添加到 Postgres 中,创建了 PostgreSQL,从此成为地球上最受欢迎的数据库之一。2Illustra 团队必须重新设计他们的数据库以提供本机 SQL 支持。与此同时,SQL 在 ANSI 中被扩展为对象关系。这将体现在 1999 年 SQL 的标准语言扩展(SQL3)中,涵盖对象抽象(数据类型和方法),但不涵盖任何实现扩展(例如,存储方法)。Illustra 团队没有参加标准化委员会,这让他们在协调不同模型方面遇到了一些困难。Illustra 拥有该市场最好的技术、数据库和工具,但跳过了几个步骤来符合 SQL。

When Mike created Postgres in 1986, QUEL was the database language to which he added extensibility operators (PostQuel). SQL was added to Postgres in the mid-1990s, creating PostgreSQL, which has since become one of the most popular databases on the planet.2 The Illustra team had to re-engineer their database to provide native SQL support. At the same time, SQL was being extended in ANSI for object-relational. This would manifest in standard language extensions to SQL in 1999 (SQL3), covering object abstraction (data types and methods), and not covering any implementation extensions (e.g., storage methods). The Illustra team didn’t participate in the standardization committees, which set them back a bit to reconcile the different models. Illustra had the best technology, database, and tools for this market, but skipped a couple of steps to conform to SQL.

我们其他人将落后几年,将精力集中在不断发展的分布式计算复杂性以及对事务和分析的性能和可用性(并行性)的永不满足的需求上。在关系领域,Tandem 为高可用性事务处理设定了标准,Teradata 为高性能查询处理设定了标准。到了 20 世纪 80 年代末,得益于摩尔定律,1970 年代的小型计算机已经变得足够强大,可以组合起来完成大型机的工作,而且价格也越来越便宜。为了在成本上竞争,大型机被重新设计为使用 CMOS(小型机的基础技术),从而产生了 IBM 的并行系统综合体 [IBM 1997,Josten 等人。1997],一组 IBM 大型机一起作为单一系统映像,于 1995 年投放市场。

The rest of us would lag behind another couple of years, focusing our energies on the evolving distributed computing complex and the insatiable demands for performance and availability for transaction and analytics (parallelism). In the relational arena, Tandem set the bar for highly available transaction processing, and Teradata set the bar for high-performance query processing. By the late 1980s, thanks to Moore’s Law, the minicomputers of the 1970s were growing powerful enough to be combined to do the work of a mainframe, only increasingly cheaper. To compete on cost, the mainframe was reengineered to use CMOS, the technology underlying the minis, resulting in IBM’s Parallel Sysplex [IBM 1997, Josten et al. 1997], a cluster of IBM mainframes acting together as a single-system image delivered to market in 1995.

随着商业环境中个人计算机和工作站的日益普及和功能的不断增长,客户端-服务器体系结构应运而生。Sybase 于 20 世纪 80 年代末首次亮相,开创了客户端-服务器的先河。随着Sun和其他网络计算机的发展,企业架构从单层计算发展到多层计算。分布式计算已经到来。我们收获了 IBM 的 System R* [Lindsay 1987]、Mike 的 Ingres 和 David De-Witt 的 Gamma [DeWitt 等人,2017] 的成果。1990] 提供用于事务处理和高度并行查询处理的分布式数据库技术。20 世纪 90 年代初期,随着 IBM 领导层的变化,我们获得资助在流行的硬件平台(IBM、HP、Sun)上的开放系统(UNIX 等)上构建 DB2。开放系统、分布式数据库的结合,我们在 20 世纪 90 年代中期的开发精力以及移动和嵌入式数据库。

The client-server architecture emerged on the back of the growing popularity and capabilities of personal computers and workstations in business environments. Sybase debuted in the late 1980s, pioneering client-server. With the evolution of Sun and other network computers, enterprise architectures evolved from single-tier to multi-tier computing. Distributed computing had arrived. We harvested the work of IBM’s System R* [Lindsay 1987], Mike’s Ingres, and David De-Witt’s Gamma [DeWitt et al. 1990] to deliver distributed database technology for transaction processing and highly parallelized query processing. In the early 1990s, with a change in IBM leadership, we were funded to construct DB2 on open systems (UNIX, etc.) on popular hardware platforms (IBM, HP, Sun). The combination of open systems, distributed database, and massive parallelism would occupy most of our development energies through the mid-1990s along with mobile and embedded databases.

整个 20 世纪 80 年代,各种数据库供应商的企业内数据量和数据库数量都在快速增长。客户将其交易系统与分析系统分开,以更好地管理性能。他们发现需要跨多个数据源分析数据。这就催生了数据仓库,它从多个来源提取数据,对其进行管理,并将其存储在为分析而设计和调整的数据库中。另一种架构,联邦,被提出,它允许跨不同数据源进行分析,而无需将数据复制到单独的存储中。这种架构非常适合海量数据(复制成本高昂)以及接近实时的要求。IBM 研究中心的 Garlic 项目 [Josifovski 2002] 和 Mike 的 Mariposa 项目 [Stonebraker 等人。1994a] 探索了这种架构,通过利用底层存储的独特性能特征,促进了语义集成技术和不同数据库查询性能的发展。Mariposa 成为 Cohera Corporation 3的基础应用程序系统,后来于 20 世纪 90 年代初并入 PeopleSoft。Garlic 于 2002 年作为联邦服务器合并到 DB2 中。由于管理异构联邦拓扑的复杂性,这两种服务器都没有得到广泛普及。截至 2018 年,我们看到了多模型数据库和 Mike 的 Polystore(参见第 22 章)的演变,它利用联邦技术和对象关系的建模功能来集成一组数据模型(关系、图形、键值),同时为单个数据模型提供同类最佳的功能——有点像 OSFA(One Size Fits All)[Stonebraker 和 Çetintemel 2005]。

The volume of data and number of databases within enterprises grew rapidly through the 1980s across the spectrum of database vendors. Customers separated their transactional systems from their analytical systems to better manage performance. And they discovered that they needed to analyze data across multiple data sources. This gave rise to data warehousing, which extracted data from multiple sources, curated it, and stored it in a database designed and tuned for analytics. An alternative architecture, federation, was proposed, which allowed for analytics across disparate data sources without copying the data into a separate store. This architecture was well suited for huge data, where copying was a prohibitive cost, as well as near real-time requirements. IBM Research’s Garlic project [Josifovski 2002] and Mike’s Mariposa project [Stonebraker et al. 1994a] explored this architecture, spurring technology in semantic integration and performance in queries on disparate databases by taking advantage of the unique performance characteristics of the underlying stores. Mariposa became the basis for Cohera Corporation’s3 application system and was later incorporated in PeopleSoft in the early 1990s. Garlic was incorporated in DB2 as the Federated Server in 2002. Neither reached widespread popularity because of the complexity in managing heterogeneous, federated topologies. As of 2018, we’re seeing the evolution of multi-model databases and Mike’s polystore (see Chapter 22), which draw on the federated technology and the modeling capabilities of object-relational to integrate a set of data models (relational, graph, key value) while providing best-of-breed capability for the individual data model—a bit of a snap back to OSFA (One Size Fits All) [Stonebraker and Çetintemel 2005].

20 世纪 90 年代末,IBM 数据库管理执行团队认为该公司过于内向。我们需要关于技术方向和业务方向的外部视角,并根据最佳行业实践来评估自己。我请 Mike 向 Janet Perna 领导的 IBM 数据库管理产品主管介绍技术的外部观点。尽管当时他正在与 IBM 竞争,但迈克还是同意了。他的工作非常出色。他的信息很明确:“您需要加快步伐,超越 IBM 平台。” 并且达到了预期的效果。我们加大了力度。

In the late 1990s, the executive team for database management within IBM viewed the company as too inward-focused. We needed external perspective on technology directions as well as business directions and to assess ourselves against the best industry practices. I asked Mike to present an external perspective on technology to the IBM database management product executives, led by Janet Perna. Although he was competing with IBM at the time, Mike agreed. And he did a stellar job. His message was clear: “You need to step it up and look beyond IBM platforms.” And it had the intended effect. We stepped it up.

2005 年,我从 IBM 退休。我尽可能少地工作,向风险投资家和初创公司提供咨询。Mike 创办了 Vertica Systems, Inc.,一家列存储基于 C-Store 的数据库 [Stonebraker 等人。2005a]。他要求我向西海岸的潜在客户展示 Vertica 技术,以便他可以更多地关注东海岸的开发团队和该地区的客户。C-Store 和 Vertica 显着提高分析系统的性能给我留下了深刻的印象(请参阅第 18 章)。我同意。我尽可能少地工作。Vertica 于 2011 年被出售给惠普。

In 2005 I retired from IBM. I worked as little as possible, consulting with venture capitalists and startups. Mike started Vertica Systems, Inc., the column-store database based on C-Store [Stonebraker et al. 2005a]. He asked me to present the Vertica technology to prospective customers on the West Coast so he could focus more on the development teams on the East Coast and the customers in that geography. I was impressed with C-Store and Vertica for dramatically improving the performance of analytical systems (see Chapter 18). I agreed. And I worked as little as possible. Vertica was sold to HP in 2011.

2015 年,Mike 获得了图灵奖,并在 IBM 发表了关于分析的演讲。Mike当时正在研究 SciDB(请参阅第 20 章),他并不喜欢 IBM 在市场上推出的分析框架 Apache Spark。我被邀请与其他几位退休的 IBM 研究员一起参加这次演讲。迈克问我们中是否有人愿意为斯帕克辩护。他想要一场热烈的讨论,需要有人提供对位。我同意。好玩。那是迈克。他赢了。然后我们出去喝了一杯。

In 2015 Mike received the Turing Award and spoke at IBM on analytics. Mike was working on SciDB (see Chapter 20) at the time and he was not enamored of Apache Spark, the analytical framework that IBM was pushing in the market. I was asked to attend the talk along with a few other retired IBM Fellows. Mike asked if one of us would defend Spark. He wanted a lively discussion and needed someone to provide counterpoint. I agreed. It was fun. That was Mike. He won. Then we went out for a drink.

迈克和我在数据库宇宙中走着不同的道路。迈克是一位创新学者,他创造了商业产品,而我创造了商业产品并做了一些创新[Mohan 等人,2017]。1992]。它们听起来很相似,但其实不然。我们从不同的技术角度和想法分享对客户需求的了解,以更好地为他们服务。而且,正如我所说,迈克是一个竞争对手,一个合作者,而且永远是一个朋友。

Mike and I orbited the database universe on different paths. Mike was an innovator-academic who created commercial products, whereas I created commercial products and did some innovation [Mohan et al. 1992]. They sound alike, but they’re not. We shared our knowledge of customer needs from our different perspectives and ideas on technology to serve them better. And, as I said, Mike was a competitor, a collaborator, and always a friend.

1 . Illustra 于 1997 年被 Informix 收购,随后 Informix 又被 IBM 收购。

1. Illustra was acquired by Informix in 1997, which was in turn acquired by IBM.

2 . 在我看来,Mike 在商业上最重要的成就也许是无意的:Postgres 的开发和开源。PostgreSQL 是地球上最流行的数据库管理系统之一,为开源运动铺平了道路。

2. In my view, Mike’s most significant achievement commercially was perhaps unintentional: the development of and open sourcing of Postgres. PostgreSQL is one of the most popular database management systems on the planet and paved the way for the open-source movement.

3 . 成立于 1997 年,2001 年被 PeopleSoft 收购。

3. Founded in 1997 and acquired by PeopleSoft in 2001.

36

36

数据库守护者的变化

The Changing of the Database Guard

迈克尔·L·布罗迪

Michael L. Brodie

你可能就在一个历史性的时刻,但几十年后才意识到它的重要性。现在,我将讲述这样一个时刻:数据库社区的领导层及其核心技术开始发生根本性的变化。

You can be right there at a historic moment and yet not see its significance for decades. I now recount one such time when the leadership of the database community and its core technology began to change fundamentally.

与数据库专家共进晚餐

Dinner with the Database Cognoscenti

1972 年夏天,在 IBM 的圣何塞研究实验室与 Ted Codd 一起度过之后,数据库新星 Dennis Tsichritzis 回到多伦多大学,向 Phil Bernstein 和我本人宣布,我们将开始攻读博士学位。在他的监督下建立关系数据库。可能会出现什么问题?

After spending the summer of 1972 with Ted Codd at IBM’s San Jose Research Lab, Dennis Tsichritzis, a rising database star, returned to the University of Toronto to declare to Phil Bernstein and myself that we would start Ph.Ds. on relational databases under his supervision. What could possibly go wrong?

1974 年 5 月,我和丹尼斯一起参加了在密歇根州安娜堡举行的 ACM SIGFIDET(文件描述和翻译特别兴趣小组)会议,这是我的第一次国际会议,参加了伟大的关系-CODASYL 辩论,丹尼斯将在会上为好人而。从多伦多驱车一小段路程后,我们去参加了“战略会议”晚宴,准备第二天的辩论。晚餐是在会议酒店的 Cracker Barrel 餐厅举行的,出席者包括当前和未来的数据库行家和我(一个对数据库一无所知的人)。它以 Cracker Barrel 的招牌招牌霓虹橙色奶酪蘸 grissini(对不起,面包棒)开始,但并不吉利。

In May 1974, I went with Dennis to the ACM SIGFIDET (Special Interest Group on File Description and Translation) conference in Ann Arbor, Michigan, my first international conference, for the Great Relational-CODASYL Debate where Dennis would fight for the good guys. After the short drive from Toronto, we went to a “strategy session” dinner for the next day’s debate. Dinner, at the Cracker Barrel Restaurant in the conference hotel, included the current and future database cognoscenti and me (a database know-nothing). It started inauspiciously with Cracker Barrel’s signature, neon orange cheese dip with grissini (‘scuse me, breadsticks).

在这些行家面前,我保持沉默——Ted Codd、IBM 英国实验室的 Chris Date 和丹尼斯——还有这个高大、神秘、非常自信的家伙,Mike 某某,他是加州大学伯克利分校的新任助理教授,最近获得了密歇根大学博士学位。 D . 据他介绍,他刚刚用他的逆向查询语言 QUEL 解决了数据库安全问题。晚餐时,迈克勾画出了一些富有远见的想法。这进一步证明了我保持安静,因为我几乎不会拼写datakbase [原文如此]。

I was quiet in the presence of the cognoscenti—Ted Codd, Chris Date of IBM UK Lab, and Dennis—and this tall, enigmatic, and wonderfully confident guy, Mike something, a new UC Berkeley assistant professor and recent University of Michigan Ph.D. According to him, he had just solved the database security problem with QUEL, his contrarian query language. During dinner, Mike sketched a few visionary ideas. This was further evidence for me to be quiet since I could barely spell datakbase [sic].

伟大关系-CODASYL 辩论

The Great Relational-CODASYL Debate

在关系方面,排队参加辩论的行家是来自通用汽车公司的 Ted Codd、Dennis Tsichritzis 和 Kevin Whitney,他们实施了 RDMS [Kevin and Whitney 1974],这是最早的 RDBMS 之一。CODASYL 方面有 Charlie Bachman,他因“对数据库技术的杰出贡献”而获得 1973 年图灵奖;JR Lucking,国际计算机有限公司,英国;和 Ed Sibley,马里兰大学和国家标准局。

The cognoscenti who lined up for the debate were, on the relational side, Ted Codd, Dennis Tsichritzis, and Kevin Whitney, from General Motors, who had implemented RDMS [Kevin and Whitney 1974], one of the first RDBMSs. On the CODASYL side were Charlie Bachman, who was awarded the 1973 Turing Award “for his outstanding contributions to database technology”; J. R. Lucking, International Computers Limited, UK; and Ed Sibley, University of Maryland and National Bureau of Standards.

这场大肆宣传的辩论距离 Codd 发表具有里程碑意义的论文 [Codd 1970] 还不到三年;查理获得图灵奖一年后;Mike 和 Eugene Wong在加州大学伯克利分校开展开创性 Ingres 项目(参见第 15 章)一年;与圣何塞 IBM 研究中心System R 项目(参见第 35 章)的开始同时进行;1979 年,第一个商业 RDBMS Oracle 发布,五年后,IBM 的 DB2 于 1983 年发布(见第32 章);大约十年前,Ted 因“对数据库管理系统(特别是关系数据库)的理论和实践做出的基础性和持续贡献”而获得 1981 年图灵奖。

The much-ballyhooed debate was less than three years after Codd’s landmark paper [Codd 1970]; one year after Charlie’s Turing Award; one year into Mike’s and Eugene Wong’s pioneering Ingres project (see Chapter 15) at UC Berkeley; coincident with the beginning of the System R project (see Chapter 35) at IBM Research, San Jose; five years before the release of Oracle, the first commercial RDBMS, in 1979, followed in 1983 by IBM’s DB2 (see Chapter 32); and almost a decade before Ted was awarded the 1981 Turing Award “for his fundamental and continuing contributions to the theory and practice of database management systems,” specifically relational databases.

SIGFIDET 1974 主要关注面向记录的分层数据库和网络数据库。关系技术刚刚兴起。最重要的是,引入了 SEQUEL(现在的 SQL)[Chamberlin 和 Boyce 1974]。三篇论文讨论了概念和六篇1 RDBMS 实现:IBM Research 的 XRM-An Extended (N-ary) Relational Memory、The Peterlee IS/1 System 和 Rendezvous;惠特尼的 RDMS;ADMINS 和 MacAIMS 数据管理系统。Mike 的论文 [Stonebraker 1974b] 关于核心关系概念,如 Codd、Date 和 Whitney 的论文,与争论相反,展示了对新关系概念的简洁而深刻的理解。

SIGFIDET 1974 focused largely on record-oriented hierarchical and network databases. Relational technology was just emerging. Most significantly, SEQUEL (now SQL) was introduced [Chamberlin and Boyce 1974]. Three papers discussed concepts and six1 RDBMS implementations: IBM Research’s XRM-An Extended (N-ary) Relational Memory, The Peterlee IS/1 System, and Rendezvous; Whitney’s RDMS; and ADMINS and the MacAIMS Data Management System. Mike’s paper [Stonebraker 1974b] on a core relational concept, like those of Codd, Date, and Whitney, showed a succinct and deep understanding of the new relational concepts, in contrast to the debate.

这场备受期待的辩论非常激烈,但事后看来,却相当乏味,更像是一场教程,因为人们正在努力应对与当时流行的新的关系理念。考虑到数据库技术的新兴状态以及随后的关系型与 CODASYL 的历史,这份 23 页的辩论记录 [SIGFIDET 1974] 对于当前的数据库人员来说应该很有吸引力。Ted、一些 IBM 员工、Whitney、Mike 以及其他大约五个人是拥挤的房间里唯一有 RDBMS 实施经验的人。其中,只有特德·惠特尼和凯文·惠特尼在辩论中发言。其他人都生活在不同的世界中。从文字记录来看,迈克显得出奇地安静。2事实上,他一直举手,但从未被叫到。3

The much-anticipated debate was highly energized yet, in hindsight, pretty ho-hum, more like a tutorial as people grappled with new relational ideas that were so different from those prevalent at the time. The 23-page debate transcript [SIGFIDET 1974] should be fascinating to current database folks given the emergent state of database technology and the subsequent relational vs. CODASYL history. Ted, some IBMers, Whitney, Mike, and about five others were the only people in the crowded room that had any RDBMS implementation experience. Of that number, only Ted and Kevin Whitney spoke in the debate. Everyone else was living in a different world. From the transcript, Mike seemed curiously quiet.2 Truth was he had his hand up the whole time but was never called upon.3

事后看来,大多数问题/评论似乎都很奇怪。“为什么无主集合比导航数据更好?” “为什么网络模型作为英语的目标比关系模型差?” “与 CODASYL 子模式相比,我找不到很多关系子语言的示例。” “我可以想到我拥有的许多系统中都会出现问题,因此几乎不可能,而且肯定不切实际,以自动化的方式得出答案。对我来说,这归结为一个经济学问题。为了向任何人提供这种可用性,值得花费金钱和时间吗?” 相比之下,Ted 的明确重点是“应用程序编程、对非程序员的支持……以及实现”以及逻辑和物理数据独立性,这些独立性仍然是关系模型的基石 [Codd 1970,Date 和 Codd 1975],Mike [Stonebraker 1974b] 简洁地强调了这一点,与网络方法和辩论中所说的大部分内容形成鲜明对比。关系方面正在铸造珍珠[马太福音7:6]。

In hindsight, most questions/comments seem weird. “Why were ownerless sets better than navigating data?” “Why is the network model worse than the relational model as a target for English?” “I couldn’t find many examples of the relational sublanguage compared to CODASYL subschemas.” “I can think of many systems that I have had in which questions would come up so that it was almost impossible, and certainly impractical, to automate a way of coming up with the answer. To me, it boils down to a question of economics. Is it worth spending the money and taking the time to be able to provide this kind of availability to anybody?” In contrast, Ted’s clear focus was on “applications programming, support of non-programmers, … and implementation” and on the logical and physical data independence that remain the cornerstones of the relational model [Codd 1970, Date and Codd 1975], emphasized succinctly by Mike [Stonebraker 1974b] and in sharp contrast to the network approach and most of what was said in the debate. The relational side was casting pearls [Matthew 7:6].

尽管这场辩论预计会燃放烟火,但它却平淡无奇。于是,我的思绪飘到了JR Lucking身上,他自始至终都在抽烟。毕竟那是 1974 年了。为什么要关注呢?这让人分心。烟一直没有出来。我们想象 JR 会定期离开房间,清空原本空荡荡的烟雾和灰烬。

For all the fireworks projected for the debate, it was bland. So, my mind wandered to J.R. Lucking, who smoked a cigarette throughout. It was, after all, 1974. Why pay attention? It was distracting. Smoke never came out. We imagined that J.R. periodically left the room to empty an otherwise hollow leg of smoke and ash.

这场辩论在室外没有产生什么影响。真正的争论在 20 世纪 80 年代中期在备受质疑的 Oracle 和 DB2 4采用之后在市场上得到了解决,正如 Mike 所说,SQL 成为“星际数据说话”。如果没有 Pat Selinger 的查询优化(Ted 的逻辑和物理数据独立性以及花费在查询和性能优化上的数万开发时间)实现的 RDBMS 性能,Codd 模型的优雅永远不会成功。

The debate had little impact outside the room. The real debate was resolved in the marketplace in the mid-1980s after the much-doubted adoption of Oracle and DB24 and as SQL became, as Mike called it, “intergalactic data speak.” The elegance of Codd’s model would never have succeeded had it not been for RDBMS performance due to Pat Selinger’s query optimization, enabled by Ted’s logical and physical data independence,5 plus tens of thousands of development hours spent on query and performance optimization.

辩论和会议对我产生了巨大的影响。特德·科德成为我的良师益友,在整个马岛战争期间,他几乎每天都给我打电话,回顾英国皇家空军 (RAF) 当天的努力。6查理住在马萨诸塞州列克星敦,与我同街,后来为我提供了一份首席技术官的职位。我拒绝了,但获得了关于我不知道自己有的纽扣的服装知识。Ed Sibley,我在马里兰大学的第一个学术老板,任命我为 ANSI/SPARC(美国国家标准协会,标准规划和要求委员会)关系标准委员会主席,我与其他学者一起提议将关系标准化微积分和代数,允许多种语法,例如 SQL、QUEL 和 QBE。我把这份工作丢给了一位 IBM 员工,他提供了 200 页的 SQL 规范。(谁知道标准是一项业务而不是技术?没有人告诉我任何事情。)

The debate and conference had a huge impact … on me. Ted Codd became a mentor and friend, calling me almost daily throughout the Falklands War to review the day’s efforts of Britain’s Royal Air Force (RAF).6 Charlie, who lived up the street from me in Lexington, MA, later offered me a CTO job. I declined but gained sartorial knowledge about buttons that I didn’t know I had. Ed Sibley, my first academic boss at the University of Maryland, got me appointed chair of the ANSI/SPARC (American National Standards Institute, Standards Planning and Requirements Committee) Relational Standards Committee, where I proposed, with other academics, to standardize the relational calculus and algebra, to allow multiple syntaxes, e.g., SQL, QUEL, and QBE. I lost that job to an IBMer who came with a 200-page SQL specification. (Who knew that standards were a business and not a technical thing? Nobody tells me anything.)

虽然这场辩论当时对社区影响不大,但它标志着数据库研究和产品开发的层级化和网络化时期领导者的立场发生了转变。在辩论中,他们提出了一些奇怪的问题,大概是试图理解相对于他们最了解的新想法。火炬正在传递给那些将领导这个关系时期的人,这个关系时期在近半个世纪后仍然强劲。正如 1981 年、1998 年和 2014 年图灵奖所证明的那样,新的领导者是 Ted Codd、Jim Gray 和 Michael Stonebraker。随着争论期间建立了十多个关系型 DBMS,以及三个最重要的关系型 DBMS 正在开发中,数据库技术向关系数据库的转变正在进行中。

While the debate had little impact on the community at the time, it marked the changing of the guard from the leaders of the hierarchal and network period of database research and product development. In the debate, they had posed the odd questions presumably trying to understand the new ideas relative to what they knew best. The torch was being passed to those who would lead the relational period that is still going strong almost half a century later. As the 1981, 1998, and 2014 Turing Awards attest, the new leaders were Ted Codd, Jim Gray, and Michael Stonebraker. With more than ten relational DBMSs built at the time of the debate and the three most significant relational DBMSs in the works, the database technology shift to relational databases was under way.

迈克:比辩论甚至奶酪更令人难忘

Mike: More Memorable than the Debate, and Even the Cheese

除了霓虹灯橙色奶酪、SQL 以及对数据库行家的敬畏之外,SIGFIDET 1974 没有什么值得纪念的,除了会见 Mike Stonebraker。Mike Something 成为了我一生的同事和朋友。尽管迈克是一个陌生人,也是战略晚宴上资历最浅的学者(我不算),但迈克却令人难忘,随着时间的推移更是如此。我:“嘿,迈克,还记得关系-可达西辩论之前的那顿晚餐吗?” 迈克:“抱歉,我不记得了。” 也许这就像保罗·麦卡特尼的一场粉丝见面会:两人中只有一个记得。在本章中,我向 Dennis Tsichritzis 和其他数据库专家询问了这次活动中令人难忘的时刻,他们的一致回答是“不是真的”。唐·张伯伦和雷·博伊斯,SQL 发明者就在那里 [Chamberlin 和 Boyce 1974]。但大多数未来的关系行家甚至还没有加入数据库派对。Bruce Lindsay 和 Jim Gray 在加州大学伯克利分校,并于当年转到 IBM 的 System R 项目。演奏家帕特·塞林格 (Pat Selinger) 在哈佛大学(1975 年获得博士学位),直到获得博士学位后才加入 System R。1974 年的 SIGFIDET 是一个里程碑,它标志着守旧派的终结和关系时代的出现,大多数关系专家仍然对关系模型还很陌生,而 Mike Stonebraker 却在不知不觉中占据了领先地位。

Apart from the neon orange cheese, SQL, and being awed by database cognoscenti, there was little memorable about SIGFIDET 1974, except meeting Mike Stonebraker. Mike Something became a colleague and friend for life. Although a stranger and the most junior academic at the strategy dinner (I don’t count), Mike was unforgettable, more so as time went on. Me: “Hey, Mike, remember that dinner before the Relational-CODASYL Debate?” Mike: “Sorry, I don’t remember.” Maybe it’s like a fan meeting Paul McCartney: Only one of the two remembers. For this chapter, I asked Dennis Tsichritzis and other database cognoscenti for memorable moments at this event, to a uniform response of “not really.” Don Chamberlain and Ray Boyce, SQL inventors, were there [Chamberlin and Boyce 1974]. But most future relational cognoscenti had not even joined the database party. Bruce Lindsay and Jim Gray were at UC Berkeley and would move that year to the System R project at IBM. The instrumental Pat Selinger was at Harvard (Ph.D. 1975) and wouldn’t join System R until after her Ph.D. SIGFIDET 1974 was a milestone that marked the end of the old guard and emergence of the relational era with most of the relational cognoscenti still wet behind the relational model, and Mike Stonebraker, unwittingly, taking the lead.

时至今日,迈克仍然极其简洁、热情、富有远见且极其自信。然而在辩论中,他却出奇地安静(没有被点名),尤其是因为他是理解 Ted 模型并有实施经验的 1% 人之一。也许他正在获得海上能力。他担任助理教授大约三年了。四十年后,在他的纪念文中,迈克回忆起那段终身教职的磨练岁月,这是他职业生涯中最糟糕的时期,因为学术生活的压力——教学和终身教职,在一个新的领域,从头开始开发最重要的数据库系统之一,同时,正如 Don Haderle 在第 35 章中所说,必须学习“如何创建和经营一家企业”。在 SIGFIDET,他对数据库还是个新手,两年前,当 Mike 想知道在加州大学伯克利分校做什么时,Gene Wong 建议他阅读 Codd 的论文 [Codd 1970],他就了解了数据库是什么。迈克的第一个博士学位。学生 Jerry Held 已经实施了 DBMS。到 1974 年 5 月,Mike 已经给关系行家、当时数据库的未来留下了深刻的印象。如今,在会议上,人们普遍等待听到迈克的意见。或者在他缺席的情况下,如 VLDB 2017,Mike 的观点往往会在每次主题演讲中被引用。在对他至关重要的问题上,他以简洁的观察和问题直击问题的核心。例如,他可能会问:“您设想什么用例和工作负载?” 答:大黄,大黄,大黄。迈克 回复: “有趣的。VoltDB 就属于这个领域,但七年来从未遇到过任何客户需要这些功能。”

To this day, Mike is succinct in the extreme, intense, visionary, and superbly confident. Yet at the debate, he was strangely quiet (not called upon) especially as he was in the 1% who understood Ted’s model and had implementation experience. Perhaps he was gaining his sea legs. He had been an assistant professor for about three years. Forty years later, at his Festschrift, Mike recalled those tenure-grinding years as the worst of his career due to the pressures of an academic life—teaching and tenure, in a new domain, developing one of the most significant database systems from scratch, while, as Don Haderle says in Chapter 35, having to learn “how to create and operate a business.” At SIGFIDET he was new to databases, having learned what a database was two years earlier when, while Mike was wondering what to do at UC Berkeley, Gene Wong had suggested that he read Codd’s paper [Codd 1970]. Mike’s first Ph.D. student, Jerry Held, had already implemented a DBMS. By May 1974, Mike had already impressed the relational cognoscenti, the then-future of databases. Today at conferences, people universally wait to hear Mike’s opinions. Or in his absence, as at VLDB 2017, Mike’s opinions tend to be quoted in every keynote speech. On issues critical to him, he speaks out with succinct observations and questions that get right to the heart of the matter. For example, he might ask, “What use-case and workload do you envisage?” Answer: Rhubarb, rhubarb, rhubarb. Mike replies: “Interesting. VoltDB is in that space but in seven years has never encountered a single customer asking for those features.”

十年后:朋友还是敌人?

A Decade Later: Friend or Foe?

在南卡罗来纳州基洼岛举行的第一届国际专家数据库系统会议上 [van de Riet 1986],我与 Mike 就“数据模型死了吗?”这一主题进行了辩论。我不记得内容和语气,这肯定显得具有对抗性,因为我确实记得当迈克和我在台下拥抱时,项目委员会主席拉里·克施伯格(Larry Kerschberg)露出了完全惊讶的表情。迈克在辩论前夕到达,所以我们还没有互相打招呼。当重要的时候,迈克比大多数人都更简洁地表达自己的想法。他的直接和诚实对某些人来说可能看起来是对抗性的。我从未见过这样的意图;相反,他很快就触及了问题的核心。这丰富了一些人的讨论,也可能结束了另一些人的讨论。

At the First International Conference on Expert Database Systems, Kiawah Island, South Carolina [van de Riet 1986], I debated with Mike on the topic “Are Data Models Dead?” I do not recall the content nor the tone, which must have appeared confrontational because I do recall a look of utter surprise from Larry Kerschberg, the program committee chair, as Mike and I hugged off stage. Mike had arrived just before the debate, so we had not yet greeted each other. When it matters, Mike speaks his mind pithier than most. His directness and honesty may seem confrontational to some. I have never seen such an intent; rather, he is getting to the heart of the matter quickly. That enriches the discussion for some and can end it for others.

四十多年前,我与迈克的第一次见面令人难忘。战略晚宴上还有其他人,但我不记得了。迈克安静、冷静、简洁、聪明得可怕,而且是逆向思维。他正在成为图灵奖获得者。我的印象是他是房间里最聪明的人。我的印象,就像 Ingres、Postgres 和他的许多其他 DBMS 中的数据一样,一直存在。

My first meeting with Mike over 40 years ago was memorable. There were others at the strategy dinner, but I do not recall them. Mike was quiet, calm, succinct, scary smart, and contrarian. He was a Turing laureate in the making. My impression was that he was the smartest man in the room. My impression, like data in Ingres, Postgres, and his many other DBMSs, persists.

1 . 令人惊讶的是,在 Ted 发表具有里程碑意义的论文后的三年内,大约有 10 个 RDBMS 被实施或正在实施。

1. Amazingly, approximately 10 RDBMSs were implemented or under way within three years of Ted’s landmark paper.

2 . 迈克的这种克制并没有持续多久。

2. Holding back like that didn’t last long for Mike.

3 . 主持这场辩论的数据库行家可能没有预见到,40 年后,这位高个子、手上没有答案的新人会因为正在辩论的问题而获得图灵奖。

3. The database cognoscenti who were running the debate may not have foreseen that in 40 years the tall, new guy with the unanswered hand would receive the Turing Award for the very issues being debated.

4 . DB2 是 IBM 排名第二的 DBMS 产品,仅次于排名第一的 DBMS 产品 IMS。

4. DB2 was IBM’s #2 DBMS product after its #1 DBMS product, IMS.

5 . 迈克已经成为性能问题的核心 [Stonebraker 1974b],Date 和 Codd [1975] 在同一次会议上对此进行了雄辩的描述,但被辩论提问者错过了。Mike 将此概括为任何新数据模型及其数据管理器的关键要求。

5. Mike was already at the heart of the performance issue [Stonebraker 1974b] described so eloquently by Date and Codd [1975] in the same conference and missed by debate questioners. Mike has generalized this as the key requirement of any new data model and its data manager.

6 . 第二次世界大战期间,泰德在加拿大接受英国皇家空军飞行员训练。我是加拿大人,我的母亲是一位英国女性,曾在英国皇家女子空军服役。

6. In World War II, Ted trained as a pilot in Canada with the British Royal Air Force. I am Canadian, and my mother, an Englishwoman, had been in the British Women’s Royal Air Force.

第九部分

PART IX

迈克尔·斯通布雷克 (Michael Stonebraker) 及其合作者的开创性作品

SEMINAL WORKS OF MICHAEL STONEBRAKER AND HIS COLLABORATORS

OLTP 概览,以及我们在那里发现的内容

OLTP Through the Looking Glass, and What We Found There

Stavros Harizopoulos惠普实验室、Daniel J. Abadi耶鲁大学、Samuel Madden麻省理工学院、Michael Stonebraker麻省理工学院

Stavros Harizopoulos (HP Labs), Daniel J. Abadi (Yale University), Samuel Madden (MIT), Michael Stonebraker (MIT)

抽象的

Abstract

在线事务处理 (OLTP) 数据库包括一套功能 — 驻留在磁盘上的 B 树和堆文件、基于锁定的并发控制、多线程支持 — 这些功能针对 20 世纪 70 年代末的计算机技术进行了优化。现代处理器、内存和网络的进步意味着今天的计算机与 30 年前的计算机有很大不同,因此许多 OLTP 数据库现在可以容纳在主内存中,并且大多数 OLTP 事务可以在毫秒或更短的时间内处理。然而数据库架构几乎没有改变。

Online Transaction Processing (OLTP) databases include a suite of features—disk-resident B-trees and heap files, locking-based concurrency control, support for multi-threading—that were optimized for computer technology of the late 1970’s. Advances in modern processors, memories, and networks mean that today’s computers are vastly different from those of 30 years ago, such that many OLTP databases will now fit in main memory, and most OLTP transactions can be processed in milliseconds or less. Yet database architecture has changed little.

基于这一观察,我们研究了传统数据库系统的一些有趣的变体,人们可以利用最新的硬件趋势来构建这些变体,并通过事务处理数据库系统中涉及的主要组件的详细指令级细分来推测它们的性能(岸上)运行 TPC-C 的子集。我们不是简单地分析 Shore,而是逐步对其进行修改,以便在每次功能删除或优化后,我们都有一个(更快的)工作系统来完全运行我们的工作负载。总体而言,我们确定了开销和优化,这些开销和优化解释了原始性能大约 20 倍的总差异。我们还表明,现代(内存驻留)数据库系统中不存在单一的“帐篷中的高杆”,而是在日志记录、锁存、锁定、B 树和缓冲区管理操作上花费了大量时间。

Based on this observation, we look at some interesting variants of conventional database systems that one might build that exploit recent hardware trends, and speculate on their performance through a detailed instruction-level breakdown of the major components involved in a transaction processing database system (Shore) running a subset of TPC-C. Rather than simply profiling Shore, we progressively modified it so that after every feature removal or optimization, we had a (faster) working system that fully ran our workload. Overall, we identify overheads and optimizations that explain a total difference of about a factor of 20x in raw performance. We also show that there is no single “high pole in the tent” in modern (memory resident) database systems, but that substantial time is spent in logging, latching, locking, B-tree, and buffer management operations.

类别和主题描述符

Categories and Subject Descriptors

H.2.4【数据库管理】:系统——事务处理;并发

H.2.4 [Database Management]: Systems—transaction processing; concurrency.

一般条款

General Terms

测量、性能、实验。

Measurement, Performance, Experimentation.

关键词

Keywords

在线事务处理、OLTP、主存事务处理、DBMS 架构。

Online Transaction Processing, OLTP, main memory transaction processing, DBMS architecture.

1  简介

1  Introduction

现代通用在线事务处理 (OLTP) 数据库系统包括一套标准功能:用于表存储的磁盘数据结构集合,包括堆文件和 B 树、通过基于锁定的并发控制支持多个并发查询、基于日志的恢复和高效的缓冲区管理器。这些功能是为了支持 20 世纪 70 年代和 1980 年代的事务处理而开发的,当时 OLTP 数据库比主内存大很多倍,而且运行这些数据库的计算机花费数十万到数百万美元。

Modern general purpose online transaction processing (OLTP) database systems include a standard suite of features: a collection of on-disk data structures for table storage, including heap files and B-trees, support for multiple concurrent queries via locking-based concurrency control, log-based recovery, and an efficient buffer manager. These features were developed to support transaction processing in the 1970’s and 1980’s, when an OLTP database was many times larger than the main memory, and when the computers that ran these databases cost hundreds of thousands to millions of dollars.

今天,情况已大不相同。首先,现代处理器速度非常快,因此许多 OLTP 类型事务的计算时间以微秒为单位。只需花费几千美元,就可以购买具有千兆字节主内存的系统。此外,机构拥有许多此类工作站的网络集群并不罕见,总内存达到数百 GB,足以将许多 OLTP 数据库保存在 RAM 中。

Today, the situation is quite different. First, modern processors are very fast, such that the computation time for many OLTP-style transactions is measured in microseconds. For a few thousand dollars, a system with gigabytes of main memory can be purchased. Furthermore, it is not uncommon for institutions to own networked clusters of many such workstations, with aggregate memory measured in hundreds of gigabytes—sufficient to keep many OLTP databases in RAM.

其次,互联网的兴起以及许多领域中使用的各种数据密集型应用程序导致人们对没有全套标准数据库功能的类似数据库的应用程序越来越感兴趣。操作系统和网络会议现在充满了“类似数据库”存储的提案具有不同形式的一致性、可靠性、并发性、复制性和可查询性的系统[ DG04CDG+06GBH+00SMK+01 ]。

Second, the rise of the Internet, as well as the variety of data intensive applications in use in a number of domains, has led to a rising interest in database-like applications without the full suite of standard database features. Operating systems and networking conferences are now full of proposals for “database-like” storage systems with varying forms of consistency, reliability, concurrency, replication, and queryability [DG04, CDG+06, GBH+00, SMK+01].

对类似数据库服务不断增长的需求,加上硬件性能的显着提高和成本降低,表明了许多有趣的替代系统,人们可以使用与标准 OLTP 引擎提供的功能不同的一组功能来构建这些系统。

This rising demand for database-like services, coupled with dramatic performance improvements and cost reduction in hardware, suggests a number of interesting alternative systems that one might build with a different set of features than those provided by standard OLTP engines.

1.1  替代 DBMS 架构

1.1  Alternative DBMS Architectures

显然,当数据库适合 RAM 时,针对主内存优化 OLTP 系统是一个好主意。但许多其他数据库变体也是可能的;例如:

Obviously, optimizing OLTP systems for main memory is a good idea when a database fits in RAM. But a number of other database variants are possible; for example:

•  无日志数据库。无日志数据库系统可能不需要恢复,或者可能从集群中的其他站点执行恢复(正如 Harp [ LGG+91 ]、Harbor [ LM06 ] 和 C-Store [ SAB+05 ]等系统中所建议的那样) )。

•  Logless databases. A log-free database system might either not need recovery, or might perform recovery from other sites in a cluster (as was proposed in systems like Harp [LGG+91], Harbor [LM06], and C-Store [SAB+05]).

•  单线程数据库。由于 OLTP 数据库中的多线程传统上对于面对缓慢的磁盘写入时的延迟隐藏很重要,因此在内存驻留系统中它的重要性要低得多。在某些情况下,单线程实现可能就足够了,特别是在它提供良好性能的情况下。尽管需要一种在同一硬件上利用多个处理器核心的方法,但虚拟机技术的最新进展提供了一种方法,可以使这些核心看起来像不同的处理节点,而无需施加大量性能开销 [BDR97],这可能使此类设计变得可行。

•  Single threaded databases. Since multi-threading in OLTP databases was traditionally important for latency hiding in the face of slow disk writes, it is much less important in a memory resident system. A single-threaded implementation may be sufficient in some cases, particularly if it provides good performance. Though a way to take advantage of multiple processor cores on the same hardware is needed, recent advances in virtual machine technology provide a way to make these cores look like distinct processing nodes without imposing massive performance overheads [BDR97], which may make such designs feasible.

•  无事务数据库。许多系统不需要事务支持。特别是,在分布式互联网应用程序中,最终一致性通常比事务一致性更受青睐[ Bre00DHJ+07 ]。在其他情况下,轻量级事务形式(例如,所有读取都需要在任何写入之前完成)可能是可接受的 [ AMS+07SMA+07 ]。

•  Transaction-less databases. Transactional support is not needed in many systems. In particular, in distributed Internet applications, eventual consistency is often favored over transactional consistency [Bre00, DHJ+07]. In other cases, lightweight forms of transactions, for example, where all reads are required to be done before any writes, may be acceptable [AMS+07, SMA+07].

事实上,数据库社区内部已经提出了一些建议来构建具有部分或全部上述特征的数据库系统[ WSA97SMA+07 ]。然而,一个悬而未决的问题是,如果这些不同的配置实际建成,它们的性能如何。这是本文的中心问题。

In fact, there have been several proposals from inside the database community to build database systems with some or all of the above characteristics [WSA97, SMA+07]. An open question, however, is how well these different configurations would perform if they were actually built. This is the central question of this paper.

1.2  测量 OLTP 的开销

1.2  Measuring the Overheads of OLTP

为了理解这个问题,我们采用了现代开源数据库系统(Shore - 请参阅http://www.cs.wisc.edu/shore/),并在 TPC-C 基准测试的子集上对其进行了基准测试。我们最初的实施(在现代台式机上运行)每秒运行大约 640 个事务 (TPS)。然后,我们通过一次从引擎中删除一个不同的功能来修改它,每一步都生成新的基准,直到我们留下一个可以处理 12700 TPS 的查询处理代码的微小内核。该内核是一个单线程、无锁、无恢复的主存数据库系统。在分解过程中,我们确定了四个主要组件,移除这些组件可显着提高系统的吞吐量:

To understand this question, we took a modern open source database system (Shore—see http://www.cs.wisc.edu/shore/) and benchmarked it on a subset of the TPC-C benchmark. Our initial implementation—running on a modern desktop machine—ran about 640 transactions per second (TPS). We then modified it by removing different features from the engine one at a time, producing new benchmarks each step of the way, until we were left with a tiny kernel of query processing code that could process 12700 TPS. This kernel is a single-threaded, lock-free, main memory database system without recovery. During this decomposition, we identified four major components whose removal substantially improved the throughput of the system:

记录。组装日志记录并跟踪数据库结构中的所有更改会降低性能。如果不要求可恢复性或者通过其他方式(例如网络上的其他站点)提供可恢复性,则可能不需要记录日志。

Logging. Assembling log records and tracking down all changes in database structures slows performance. Logging may not be necessary if recoverability is not a requirement or if recoverability is provided through other means (e.g., other sites on the network).

锁定。传统的两阶段锁定会带来相当大的开销,因为对数据库结构的所有访问都由单独的实体(锁定管理器)控制。

Locking. Traditional two-phase locking poses a sizeable overhead since all accesses to database structures are governed by a separate entity, the Lock Manager.

闭锁。在多线程数据库中,许多数据结构必须先被锁存才能被访问。删除此功能并采用单线程方法会对性能产生显着影响。

Latching. In a multi-threaded database, many data structures have to be latched before they can be accessed. Removing this feature and going to a single-threaded approach has a noticeable performance impact.

缓冲区管理。主内存数据库系统不需要通过缓冲池访问页面,从而消除了每个记录访问的间接级别。

Buffer management. A main memory database system does not need to access pages through a buffer pool, eliminating a level of indirection on every record access.

1.3  结果

1.3  Results

图1显示了这些修改中的每一项如何影响 Shore 的底线性能(就每个 TPC-C 新订单事务的 CPU 指令而言)。我们可以看到,每个子系统本身约占总运行时间的 10% 到 35%(173 万条指令,以图的总高度表示)。这里,“手工编码优化”代表了我们对代码所做的一系列优化,主要提高了 B 树包的性能。处理查询的实际指令,标记为“有用的工作”(通过我们在手工编码的主存 B 树包之上构建的最小实现来衡量)仅约为其中的 1/60。“缓冲区管理器”下方的白色框代表我们删除了其中所有内容后的 Shore 版本 - Shore 仍然运行交易,但它使用的指令大约是原始系统的1/15,或者说是有用工作中指令数量的大约4倍。我们实现中的额外开销是由于 Shore 中的调用堆栈深度以及我们无法完全删除对事务和缓冲区管理器的所有引用这一事实造成的。

Figure 1 shows how each of these modifications affected the bottom line performance (in terms of CPU instructions per TPC-C New Order transaction) of Shore. We can see that each of these subsystems by itself accounts for between about 10% and 35% of the total runtime (1.73 million instructions, represented by the total height of the figure). Here, “hand coded optimizations” represents a collection of optimizations we made to the code, which primarily improved the performance of the B-tree package. The actual instructions to process the query, labelled “useful work” (measured through a minimal implementation we built on top of a hand-coded main-memory B-tree package) is only about 1/60th of that. The white box below “buffer manager” represents our version of Shore after we had removed everything from it—Shore still runs the transactions, but it uses about 1/15th of the instructions of the original system, or about 4 times the number of instructions in the useful work. The additional overheads in our implementation are due to call-stack depth in Shore and the fact that we could not completely strip out all references to transactions and the buffer manager.

图像

图 1   TPC-C 的新订单事务的各个 DBMS 组件的指令计数细分。条形图的顶部是主内存驻留数据库且没有线程争用的原始 Shore 性能。底部虚线是有用的工作,通过在无开销内核上执行事务来衡量。

Figure 1  Breakdown of instruction count for various DBMS components for the New Order transaction from TPC-C. The top of the bar-graph is the original Shore performance with a main memory resident database and no thread contention. The bottom dashed line is the useful work, measured by executing the transaction on a no-overhead kernel.

1.4  贡献和论文组织

1.4  Contributions and Paper Organization

本文的主要贡献在于 1) 剖析现代数据库系统中时间的流向,2) 仔细测量现代数据库系统的各种精简变体的性能,以及 3) 使用这些测量来推测可以构建的不同数据管理系统(例如,没有事务或日志的系统)的性能。

The major contributions of this paper are to 1) dissect where time goes inside of a modern database system, 2) to carefully measure the performance of various stripped down variants of a modern database system, and 3) to use these measurements to speculate on the performance of different data management systems—for example, systems without transactions or logs—that one could build.

本文的其余部分安排如下。在第 2 节中,我们讨论可能很快(或已经)过时的 OLTP 功能。在第 3 节中,我们回顾了 Shore DBMS,因为它是我们探索的起点,并描述了我们执行的分解。第 4 节包含我们对 Shore 的实验。然后,在第 5 节中,我们使用我们的测量来讨论对未来 OLTP 引擎的影响,并推测一些假设的数据管理系统的性能。我们在第 6 节中介绍了其他相关工作,并在第 7 节中进行了总结。

The remainder of this paper is organized as follows. In Section 2 we discuss OLTP features that may soon become (or are already becoming) obsolete. In Section 3 we review the Shore DBMS, as it was the starting point of our exploration, and describe the decomposition we performed. Section 4 contains our experimentation with Shore. Then, in Section 5, we use our measurements to discuss implications on future OLTP engines and speculate on the performance of some hypothetical data management systems. We present additional related work in Section 6 and conclude in Section 7.

2   OLTP 趋势

2  Trends in OLTP

正如简介中提到的,最流行的关系型 RDBMS 的根源可以追溯到 20 世纪 70 年代开发的系统,并且包括基于磁盘的索引和堆文件、基于日志的事务以及基于锁定的并发控制等功能。然而,自做出这些架构决策以来,已经过去了 30 年。目前的计算世界与这些传统系统设计时有很大不同。本节的目的是探讨这些差异的影响。我们在 [ SMA+07 ]中进行了一组类似的观察。

As mentioned in the introduction, most popular relational RDBMSs trace their roots to systems developed in the 1970’s, and include features like disk-based indexing and heap files, log-based transactions, and locking-based concurrency control. However, 30 years have passed since these architectural decisions were made. At the present time, the computing world is quite different from when these traditional systems were designed; the purpose of this section is to explore the impact of these differences. We made a similar set of observations in [SMA+07].

2.1  集群计算

2.1  Cluster Computing

大多数当前一代 RDBMS 最初是在 1970 年代为共享内存多处理器编写的。许多供应商在 20 世纪 80 年代增加了对共享磁盘架构的支持。过去二十年见证了 Gamma 风格的无共享数据库 [ DGS+90 ] 的出现以及用于许多大规模计算任务的商用 PC 集群的兴起。任何未来的数据库系统都必须从头开始设计,以便在此类集群上运行。

Most current generation RDBMSs were originally written for shared memory multiprocessors in the 1970’s. Many vendors added support for shared disk architectures in the 1980’s. The last two decades have seen the advent of Gamma-style shared nothing databases [DGS+90] and the rise of clusters of commodity PCs for many large scale computing tasks. Any future database system must be designed from the ground up to run on such clusters.

2.2  内存常驻数据库

2.2  Memory Resident Databases

鉴于过去几十年来 RAM 大小的急剧增加,我们有充分的理由相信许多 OLTP 系统已经适合或即将适合主内存,尤其是大型集群的聚合主内存。这主要是因为大多数 OTLP 系统的大小并没有像 RAM 容量那样急剧增长,因为它们记录的信息的客户、产品和其他现实世界实体的数量不符合摩尔定律。鉴于这一观察,数据库供应商创建针对内存驻留系统的常见情况进行优化的系统是有意义的。在此类系统中,优化的索引 [ RR99RR00 ] 以及避开磁盘优化的元组格式和页面布局(或缺乏)[ GS92] 是需要考虑的重要因素。

Given the dramatic increase in RAM sizes over the past several decades, there is every reason to believe that many OLTP systems already fit or will soon fit into main memory, especially the aggregate main memory of a large cluster. This is largely because the sizes of most OTLP systems are not growing as dramatically as RAM capacity, as the number of customers, products, and other real world entities they record information about does not scale with Moore’s law. Given this observation, it makes sense for database vendors to create systems that optimize for the common case of a memory resident system. In such systems, optimized indices [RR99, RR00] as well as eschewing disk-optimized tuple formats and page layouts (or lack thereof) [GS92] are important to consider.

2.3   OLTP系统中的单线程

2.3  Single Threading in OLTP Systems

所有现代数据库都包含对多线程的广泛支持,包括事务并发控制协议的集合以及使用锁存命令对其代码的广泛渗透,以支持多个线程访问共享结构(例如缓冲池和索引页)。多线程的传统动机是允许事务处理代表一个事务进行,而另一个事务则等待来自磁盘的数据,并防止长时间运行的事务阻碍短事务的进展。

All modern databases include extensive support for multi-threading, including a collection of transactional concurrency control protocols as well as extensive infiltration of their code with latching commands to support multiple threads accessing shared structures like buffer pools and index pages. The traditional motivations for multi-threading are to allow transaction processing to occur on behalf of one transaction while another waits for data to come from disk, and to prevent long-running transactions from keeping short transactions from making progress.

我们声称这些动机都不再有效。首先,如果数据库驻留在内存中,则永远不会有任何磁盘等待。此外,生产事务系统不包括任何用户等待——事务几乎完全通过存储过程执行。其次,OLTP 工作负载非常简单。典型的事务由一些索引查找和更新组成,在内存驻留系统中,可以在数百微秒内完成。此外,随着现代数据库行业分为事务处理和仓储市场,长期运行(分析)查询现在由仓库提供服务。

We claim that neither of these motivations is valid any more. First, if databases are memory resident, then there are never any disk waits. Furthermore, production transaction systems do not include any user waits—transactions are executed almost exclusively through stored procedures. Second, OLTP workloads are very simple. A typical transaction consists of a few index lookups and updates, which, in a memory resident system, can be completed in hundreds of microseconds. Moreover, with the bifurcation of the modern database industry into a transaction processing and a warehousing market, long running (analytical) queries are now serviced by warehouses.

一个问题是需要多线程来支持具有多个处理器的机器。然而,我们认为,可以通过将具有多个处理器的一个物理节点视为无共享集群中的多个节点来解决这个问题,也许由在这些逻辑节点之间动态分配资源的虚拟机监视器进行管理[BDR97 ]

One concern is that multi-threading is needed to support machines with multiple processors. We believe, however, that this can be addressed by treating one physical node with multiple processors as multiple nodes in a shared-nothing cluster, perhaps managed by a virtual machine monitor that dynamically allocates resources between these logical nodes [BDR97].

另一个担忧是网络将成为新的磁盘,为分布式事务引入延迟并需要重新引入事务。在一般情况下确实如此,但对于许多事务应用程序来说,可以将工作负载划分为“单站点”[ Hel07SMA+07 ],这样所有事务都可以完全在单个节点上运行一个集群。

Another concern is that networks will become the new disks, introducing latency into distributed transactions and requiring the re-introduction of transactions. This is certainly true in the general case, but for many transaction applications, it is possible to partition the workload to be “single-sited” [Hel07, SMA+07], such that all transactions can be run entirely on a single node in a cluster.

因此,某些类别的数据库应用程序不需要多线程支持;在这样的系统中,遗留的锁定和锁存代码变得不必要的开销。

Hence, certain classes of database applications will not need support for multithreading; in such systems, legacy locking and latching code becomes unnecessary overhead.

2.4  高可用性与日志记录

2.4  High Availability vs. Logging

生产事务处理系统需要 24x7 的可用性。因此,大多数系统都使用某种形式的高可用性,本质上是使用两倍(或更多)倍的硬件来确保在发生故障时有可用的备用。

Production transaction processing systems require 24x7 availability. For this reason, most systems use some form of high availability, essentially using two (or more) times the hardware to ensure that there is an available standby in the event of a failure.

最近的论文 [ LM06 ] 表明,至少对于仓库系统来说,可以利用这些可用的备用数据库来促进恢复。特别是,可以通过从其他数据库副本复制丢失的状态来完成恢复,而不是使用 REDO 日志。在我们之前的工作中,我们声称这也可以用于交易系统[ SMA+07 ]。如果事实确实如此,那么遗留数据库中的恢复代码也将变得不必要的开销。

Recent papers [LM06] have shown that, at least for warehouse systems, it is possible to exploit these available standbys to facilitate recovery. In particular, rather than using a REDO log, recovery can be accomplished by copying missing state from other database replicas. In our previous work we have claimed that this can be done for transaction systems as well [SMA+07]. If this is in fact the case, then the recovery code in legacy databases becomes also unnecessary overhead.

2.5  交易变体

2.5  Transaction Variants

尽管许多 OLTP 系统显然需要事务语义,但最近(特别是在互联网领域)出现了关于具有宽松一致性的数据管理系统的建议。通常,需要的是某种形式的最终一致性[ Bre00DHJ+07 ]相信可用性比事务语义更重要。这种环境的数据库可能几乎不需要为事务开发的机制(例如日志、锁、两阶段提交等)。

Although many OLTP systems clearly require transactional semantics, there have recently been proposals—particularly in the Internet domain—for data management systems with relaxed consistency. Typically, what is desired is some form of eventual consistency [Bre00, DHJ+07] in the belief that availability is more important than transactional semantics. Databases for such environments are likely to need little of the machinery developed for transactions (e.g., logs, locks, two-phase commit, etc.).

即使需要某种形式的严格一致性,许多稍微宽松的模型也是可能的。例如,快照隔离(非事务性)的广泛采用表明许多用户愿意用事务性语义来换取性能(在这种情况下,是因为消除了读锁)。

Even if one requires some form of strict consistency, many slightly relaxed models are possible. For example, the widespread adoption of snapshot isolation (which is non-transactional) suggests that many users are willing to trade transactional semantics for performance (in this case, due to the elimination of read locks).

最后,最近的研究表明,有限形式的事务需要比标准数据库事务少得多的机器。例如,如果所有事务都是“两阶段”的(即,它们在任何写入之前执行所有读取,并且保证在完成读取后不会中止),则不需要 UNDO 日志记录 [ AMS+ 07SMA +07 ]。

And finally, recent research has shown that there are limited forms of transactions that require substantially less machinery than standard database transactions. For example, if all transactions are “two-phase”—that is, they perform all of their reads before any of their writes and are guaranteed not to abort after completing their reads—then UNDO logging is not necessary [AMS+07, SMA+07].

2.6  总结

2.6  Summary

正如我们的参考资料所示,包括 Amazon [ DHJ+07 ]、HP [ AMS+07 ]、NYU [ WSA97 ] 和 MIT [ SMA+07 ] 在内的多个研究小组已经表现出对构建与经典 OTLP 设计有很大不同的系统的兴趣。特别是 MIT H-Store [ SMA+07]系统表明,删除所有上述功能可以使事务吞吐量提高两个数量级,这表明其中一些数据库变体可能会提供卓越的性能。因此,传统数据库供应商似乎有必要考虑生产明确禁用其中一些功能的产品。为了帮助这些实施者了解他们可能考虑构建的不同变体的性能影响,我们继续对 Shore 及其创建的变体进行详细的性能研究。

As our references suggest, several research groups, including Amazon [DHJ+07], HP [AMS+07], NYU [WSA97], and MIT [SMA+07] have demonstrated interest in building systems that differ substantially from the classic OTLP design. In particular, the MIT H-Store [SMA+07] system demonstrates that removing all of the above features can yield a two-order-of-magnitude speedup in transaction throughput, suggesting that some of these databases variants are likely to provide remarkable performance. Hence, it would seem to behoove the traditional database vendors to consider producing products with some of these features explicitly disabled. With the goal of helping these implementers understand the performance impact of different variants they may consider building, we proceed with our detailed performance study of Shore and the variants of it we created.

3  岸边

3  Shore

Shore(可扩展异构对象存储库)由威斯康星大学于 1990 年代初开发,被设计为借鉴文件系统和面向对象数据库技术的类型化持久对象系统 [CDF+94 ]。它具有分层架构,允许用户从多个组件中为其应用程序选择适当的支持级别。这些层(类型系统、unix 兼容性、语言异构性)是在 Shore Storage Manager (SSM) 之上提供的。存储管理器提供了所有现代 DBMS 中都具有的功能:具有两阶段锁定和预写日志记录的完全并发控制和恢复(ACID 事务属性)以及强大的实现B 树。其基本设计来自 Gray 和 Reuter 关于事务处理的开创性著作 [ GR93 ] 中描述的思想,许多算法直接从 ARIES 论文 [ MHL+92Moh89ML89 ] 中实现。

Shore (Scalable Heterogeneous Object Repository) was developed at the University of Wisconsin in the early 1990’s and was designed to be a typed, persistent object system borrowing from both file system and object-oriented database technologies [CDF+94]. It had a layered architecture that allowed users to choose the appropriate level of support for their application from several components. These layers (type system, unix compatibility, language heterogeneity) were provided on top of the Shore Storage Manager (SSM). The storage manager provided features that are found in all modern DBMS: full concurrency control and recovery (ACID transaction properties) with two-phase locking and write-ahead logging, along with a robust implementation of B-trees. Its basic design comes from ideas described in Gray’s and Reuter’s seminal book on transaction processing [GR93], with many algorithms implemented straight from the ARIES papers [MHL+92, Moh89, ML89].

对该项目的支持于 1990 年代末结束,但对 Shore Storage Manager 的支持仍在继续。自 2007 年起,SSM 5.0 版可用于 Intel x86 处理器上的 Linux。在本文中,我们使用“Shore”来指代 Shore Storage Manager。Shore 的信息和源代码可在线获取。1在本节的其余部分中,我们将讨论 Shore 的关键组件、其代码结构、影响端到端性能的 Shore 特性,以及我们的一组修改以及这些修改对代码行的影响。

Support for the project ended in the late 1990’s, but continued for the Shore Storage Manager; as of 2007, SSM version 5.0 is available for Linux on Intel x86 processors. Throughout the paper we use “Shore” to refer to the Shore Storage Manager. Information and source code of Shore is available online.1 In the rest of this section we discuss the key components of Shore, its code structure, the characteristics of Shore that affect end-to-end performance, along with our set of modifications and the effect of these modifications to the code line.

3.1  岸边架构

3.1  Shore Architecture

Shore 有几个特性我们没有描述,因为它们与本文无关。其中包括磁盘卷管理(我们将整个数据库预加载到主内存中)、恢复(我们不检查应用程序崩溃)、分布式事务以及 B 树以外的访问方法(例如 R 树)。其余功能可以粗略地组织成图 2所示的组件。

There are several features of Shore that we do not describe as they are not relevant to this paper. These include disk volume management (we pre-load the entire database in main memory), recovery (we do not examine application crashes), distributed transactions, and access methods other than B-trees (such as R-trees). The remaining features can be organized roughly into the components shown in Figure 2.

岸边作为图书馆提供;用户代码(在我们的例子中,TPC-C 基准测试的实现)与库链接,并且必须使用 Shore 也使用的线程库。每个事务都在 Shore 线程内运行,访问本地用户空间变量和 Shore 提供的数据结构和方法。与 OLTP 相关的方法是创建和填充数据库文件、将其加载到缓冲池、开始、提交或中止事务以及执行记录级操作(例如获取、更新、创建和删除)所需的方法。以及对主 B 树索引和辅助 B 树索引的相关操作。

Shore is provided as a library; the user code (in our case, the implementation of the TPC-C benchmark) is linked against the library and must use the threads library that Shore also uses. Each transaction runs inside a Shore thread, accessing both local user-space variables and Shore-provided data structures and methods. The methods relevant to OLTP are those needed to create and populate a database file, load it into the buffer pool, begin, commit, or abort a transaction, and perform record-level operations such as fetch, update, create, and delete, along with the associated operations on primary and secondary B-tree indexes.

在事务主体(由 begin 和 commit 语句包围)内,应用程序程序员使用 Shore 的方法来访问存储结构:文件和索引,以及用于查找它们的目录。所有存储结构都使用分槽页来存储信息。Shore 的方法在事务管理器下运行,事务管理器与所有其他组件密切交互。访问存储结构涉及对日志管理器、锁定管理器和缓冲池管理器的调用。这些调用始终通过并发控制层进行,该层监督对各种资源的共享和互斥访问。这不是一个单独的模块;相反,在整个代码中,对共享结构的所有访问都是通过获取锁存器来进行的。锁存器类似于数据库锁(它们可以是共享的或独占的),但它们是轻量级的并且没有死锁检测机制。应用程序员需要确保锁存不会导致死锁。

Inside the transaction body (enclosed by begin and commit statements) the application programmer uses Shore’s methods to access the storage structures: the file and indexes, along with a directory to find them. All the storage structures use slotted pages to store information. Shore’s methods run under the transaction manager which closely interacts with all other components. Accessing the storage structures involves calls to the Log Manager, the Lock Manager, and the Buffer Pool Manager. These invocations always happen through a concurrency control layer, which oversees shared and mutually exclusive accesses to the various resources. This is not a separate module; rather, throughout the code, all accesses to shared structures happen by acquiring a latch. Latches are similar to database locks (in that they can be shared or exclusive), but they are lightweight and come with no deadlock detection mechanisms. The application programmers need to ensure that latching will not lead to deadlock.

图像

图 2   Shore 中的基本组件(详细说明见正文)。

Figure 2  Basic components in Shore (see text for detailed description).

接下来,我们讨论线程体系结构,并提供有关锁定、日志记录和缓冲池管理的更多详细信息。

Next, we discuss the thread architecture and give more details on locking, logging, and the buffer pool management.

线程支持。Shore提供了自己的用户级、非抢占式线程包,该线程包源自NewThreads(最初由华盛顿大学开发),提供可移植的操作系统接口API。线程包的选择对 Shore 的代码设计和行为有影响。由于线程是用户级的,因此应用程序作为单个进程运行,复用所有 Shore 线程。Shore 通过生成负责 I/O 设备的单独进程(所有进程通过共享内存进行通信)来避免 I/O 阻塞。然而,应用程序不能直接利用多核(或 SMP)系统,除非它们是作为分布式应用程序的一部分构建的;然而,当简单的非用户级线程就足够时,这会给多核 CPU 增加不必要的开销。

Thread support. Shore provides its own user-level, non-preemptive thread package that was derived from NewThreads (originally developed at the University of Washington), providing a portable OS interface API. The choice of the thread package had implications for the code design and behavior of Shore. Since threads are userlevel, the application runs as a single process, multiplexing all Shore threads. Shore avoids blocking for I/O by spawning separate processes responsible for I/O devices (all processes communicate through shared memory). However, applications cannot take direct advantage of multicore (or SMP) systems, unless they are built as part of a distributed application; that, however, would add unnecessary overhead for multicore CPUs, when simple, non-user level threading would be sufficient.

因此,对于本文报告的结果,我们使用单线程操作。使用多线程操作的系统每个事务将消耗大量指令和 CPU 周期(因为除了事务代码之外还需要执行线程代码)。既然是首要目标本文的重点是查看各种数据库系统组件的 CPU 指令成本,Shore 中缺乏完整的多线程实现只会影响我们的结果,因为当我们开始删除时,我们在总 CPU 指令中的起点较低系统组件。

Consequently, for the results reported throughout this paper, we use single-threaded operation. A system that uses multithreaded operation would consume a larger number of instructions and CPU cycles per transaction (since thread code would need to be executed in addition to transactional code). Since the primary goal of the paper is to look at the cost in CPU instructions of various database system components, the lack of a full multi-threading implementation in Shore only affects our results in that we begin at a lower starting point in total CPU instructions when we begin removing system components.

锁定和记录。Shore 实现标准两阶段锁定,事务具有标准 ACID 属性。它支持分层锁定,默认情况下锁管理器会向上升级层次结构(记录、页面、存储、卷)。每个事务都会保存一个它所持有的锁的列表,以便在事务进入准备状态时可以记录锁,并在事务结束时释放锁。Shore 还实现了预写日志记录 (WAL),这需要日志管理器和缓冲区管理器之间的密切交互。在从缓冲池中刷新页面之前,可能必须刷新相应的日志记录。这也需要事务管理器和日志管理器之间的密切交互。所有三个管理器都了解日志序列号(LSN),它用于识别和定位日志中的日志记录,时间戳页,标识事务执行的最后更新,并查找事务写入的最后日志记录。每个页面都包含影响该页面的最后更新的 LSN。在将具有该页面的 LSN 的日志记录写入稳定存储之前,无法将该页面写入磁盘。

Locking and logging. Shore implements standard two-phase locking, with transactions having standard ACID properties. It supports hierarchical locking with the lock manager escalating up the hierarchy by default (record, page, store, volume). Each transaction keeps a list of the locks it holds, so that the locks can be logged when the transaction enters the prepared state and released at the end of the transaction. Shore also implements write ahead logging (WAL), which requires a close interaction between the log manager and the buffer manager. Before a page can be flushed from the buffer pool, the corresponding log record might have to be flushed. This also requires a close interaction between the transaction manager and the log manager. All three managers understand log sequence numbers (LSNs), which serve to identify and locate log records in the log, timestamp pages, identify the last update performed by a transaction, and find the last log record written by a transaction. Each page bears the LSN of the last update that affected that page. A page cannot be written to disk until the log record with that page’s LSN has been written to stable storage.

缓冲区管理器。缓冲区管理器是所有其他模块(日志管理器除外)读写页面的手段。通过向缓冲区管理器发出修复方法调用来读取页面。对于适合主存的数据库,页面总是在缓冲池中找到(在非主存情况下,如果请求的页面不在缓冲池中,则线程放弃CPU并等待负责的进程)用于将页面放入缓冲池的 I/O)。fix 方法更新页面 ID 和缓冲区帧之间的映射以及使用统计信息。为了确保一致性,有一个锁存器来控制对修复方法的访问。读取记录(一旦通过索引查找找到记录 ID)涉及

Buffer Manager. The buffer manager is the means by which all other modules (except the log manager) read and write pages. A page is read by issuing a fix method call to the buffer manager. For a database that fits in main memory, the page is always found in the buffer pool (in the non-main memory case, if the requested page is not in the buffer pool, the thread gives up the CPU and waits for the process responsible for I/O to place the page in the buffer pool). The fix method updates the mapping between page IDs and buffer frames and usage statistics. To ensure consistency there is a latch to control access to the fix method. Reading a record (once a record ID has been found through an index lookup) involves

1. 锁定记录(和页,每个分层锁定),

1.  locking the record (and page, per hierarchical locking),

2.修复缓冲池中的页面,以及

2.  fixing the page in the buffer pool, and

3. 计算记录标签页内的偏移量。

3.  computing the offset within the page of the record’s tag.

读取记录是通过发出 pin / unpin 方法调用来执行的。对记录的更新是通过将部分或全部记录从缓冲池复制到用户的地址空间,在那里执行更新,并将新数据交给存储管理器来完成的。

Reading a record is performed by issuing a pin / unpin method call. Updates to records are accomplished by copying out part or all of the record from the buffer pool to the user’s address space, performing the update there, and handing the new data to the storage manager.

表 1 OLTP 可能的优化集。

Table 1 Possible set of optimizations for OLTP.

OLTP 属性和新平台

OLTP Properties and New Platforms

数据库管理系统修改

DBMS Modification

无日志架构

logless architectures

删除日志记录

remove logging

划分、交换律

partitioning, commutativity

删除锁定(如果适用)

remove locking (when applicable)

一次一笔交易

one transaction at a time

单线程,移除锁定,移除闩锁

single thread, remove locking, remove latching

主存常驻

main memory resident

删除缓冲区管理器、目录

remove buffer manager, directory

无事务数据库

transaction-less databases

避免交易簿记

avoid transaction bookkeeping

有关 Shore 架构的更多详细信息可以在该项目的网站上找到。以下段落中还描述了一些附加机制和功能,其中我们讨论了我们自己对 Shore 的修改。

More details on the architecture of Shore can be found at the project’s web site. Some additional mechanisms and features are also described in the following paragraphs, where we discuss our own modifications to Shore.

3.2  拆除岸电部件

3.2  Removing Shore Components

表 1总结了现代 OLTP 系统(左列)的属性和特征,这些系统允许我们从 DBMS(右列)中剥离某些功能。我们使用这些优化作为修改 Shore 的指南。由于 Shore 中所有管理器的紧密集成,不可能完全分离所有组件,以便以任意顺序移除它们。下一个最好的办法是按照代码结构规定的顺序删除功能,尽可能保持灵活性。该命令如下:

Table 1 summarizes the properties and characteristics of modern OLTP systems (left column) that allow us to strip certain functionality from a DBMS (right column). We use these optimizations as a guideline for modifying Shore. Due to the tight integration of all managers in Shore, it was not possible to cleanly separate all components so that they could be removed in an arbitrary order. The next best thing was to remove features in an order dictated by the structure of the code, allowing for flexibility whenever possible. That order was the following:

1. 删除日志记录。

1.  Removing logging.

2. 移除锁定或闩锁。

2.  Removing locking OR latching.

3. 移除闭锁或锁定。

3.  Removing latching OR locking.

4. 删除与缓冲区管理器相关的代码。

4.  Removing code related to the buffer manager.

此外,我们发现可以随时执行以下优化:

In addition, we found that the following optimizations could be performed at any point:

• 简化和硬编码B 树密钥评估逻辑,正如目前大多数商业系统中所做的那样。

•  Streamline and hardcode the B-tree key evaluation logic, as is presently done in most commercial systems.

• 加速目录查找。

•  Accelerate directory lookups.

• 增加页面大小以避免频繁分配(包含在上面的步骤4 中)。

•  Increase page size to avoid frequent allocations (subsumed by step 4 above).

• 删除事务会话(开始、提交、各种检查)。

• Remove transactional sessions (begin, commit, various checks).

接下来描述我们实施上述行动的方法。一般来说,要从系统中删除某个组件,我们要么添加一些 if 语句以避免执行属于该组件的代码,或者,如果我们发现 if 语句增加了可测量的开销,我们重写整个方法以避免调用整个组件。

Our approach to implementing the above-mentioned actions is described next. In general, to remove a certain component from the system, we either add a few if-statements to avoid executing code belonging to that component, or, if we find that if-statements add a measurable overhead, we rewrite entire methods to avoid invoking that component altogether.

删除日志记录。删除日志记录包含三个步骤。第一个是避免生成 I/O 请求以及执行这些请求的相关时间(稍后,在图 7中,我们将这种修改标记为“磁盘日志”)。我们通过允许组提交然后增加日志缓冲区大小来实现这一点,以便在实验期间它不会刷新到磁盘。然后,我们注释掉所有用于准备和写入日志记录的函数(图 7中标记为“主日志” )。最后一步是在整个代码中添加 if 语句,以避免处理日志序列号(在图 7中标记为“ LSN ” )。

Remove logging. Removing logging consists of three steps. The first is to avoid generating I/O requests along with the time associated to perform these requests (later, in Figure 7, we label this modification “disk log”). We achieve this by allowing group commit and then increasing the log buffer size so that it is not flushed to disk during our experiments. Then, we comment out all functions that are used to prepare and write log records (labeled “main log” in Figure 7). The last step was to add if-statements throughout the code to avoid processing Log Sequence Numbers (labeled “LSN” in Figure 7).

移除锁定(可与移除闩锁互换)。在我们的实验中,我们发现我们可以安全地互换删除锁定和闩锁的顺序(一旦日志记录已被删除)。由于锁存也在锁定内部执行,因此删除其中一个也可以减少另一个的开销。为了删除锁定,我们首先将所有锁定管理器方法更改为立即返回,就好像锁定请求成功并且满足所有锁定检查一样。然后,我们修改了与固定记录、在目录中查找它们以及通过 B 树索引访问它们相关的方法。在每种情况下,我们都消除了与未授予的锁定请求相关的代码路径。

Remove locking (interchangeable with removing latching). In our experiments we found that we could safely interchange the order of removing locking and latching (once logging was already removed). Since latching is also performed inside locking, removing one also reduces the overhead of the other. To remove locking we first changed all Lock Manager methods to return immediately, as if the lock request was successful and all checks for locks were satisfied. Then, we modified methods related to pinning records, looking them up in a directory, and accessing them through a B-tree index. In each case, we eliminated code paths related to ungranted lock requests.

移除锁定(可与移除锁定互换)。移除锁定与移除锁定类似。我们首先更改所有互斥请求以立即满足。然后,我们在整个代码中添加了 if 语句,以避免请求锁存器。我们必须用不使用锁存器的方法替换 B 树方法,因为由于锁存器代码在 B 树方法中的紧密集成,添加 if 语句会显着增加开销。

Remove latching (interchangeable with removing locking). Removing latching was similar to removing locking; we first changed all mutex requests to be immediately satisfied. We then added if-statements throughout the code to avoid requests for latches. We had to replace B-tree methods with ones that did not use latches, since adding if-statements would have increased overhead significantly because of the tight integration of latch code in the B-tree methods.

删除缓冲区管理器调用。一旦我们知道日志记录、锁定和锁存已经被禁用,我们执行的最广泛的修改就是删除缓冲区管理器方法。为了创建新记录,我们放弃了 Shore 的页面分配机制,而是使用标准的 malloc 库。我们为每个新记录(记录不再驻留在页中)调用 malloc 并使用指针进行将来的访问。内存分配可能会更有效地完成,特别是当人们提前知道所分配对象的大小时。然而,主内存分配的进一步优化是相对于我们正在研究的开销的增量改进,并且留待将来的工作。我们无法完全删除缓冲帧的页面接口,因为删除它会使大部分剩余的 Shore 代码失效。相反,我们加速了页面和缓冲区帧之间的映射,将开销降至最低。类似地,固定和更新记录仍将通过缓冲区管理层,尽管该层非常薄(我们在图 7中将这组修改标记为“页面访问” )。

Remove buffer manager calls. The most widespread modification we performed was to remove the buffer manager methods, once we knew that logging, locking, and latching were already disabled. To create new records, we abandoned Shore’s page allocation mechanism and instead used the standard malloc library. We call malloc for each new record (records no longer reside in pages) and use pointers for future accesses. Memory allocation can potentially be done more efficiently, especially when one knows in advance the sizes of the allocated objects. However, further optimization of main memory allocation is an incremental improvement relative to the overheads we are studying, and is left for future work. We were not able to completely remove the page interface to buffer frames, since its removal would invalidate most of the remaining Shore code. Instead, we accelerated the mappings between pages and buffer frames, reducing the overhead to a minimum. Similarly, pinning and updating a record will still go through a buffer manager layer, albeit a very thin one (we label this set of modifications “page access” in Figure 7).

各种优化。我们做了四项优化,可以在删除上述组件的过程中随时调用。这些是以下。(1) 通过手动编码节点搜索来加速 B 树代码,以针对密钥为未压缩整数的常见情况进行优化(在图 5 - 8中标记为“ B 树密钥” )。(2) 通过对所有事务使用单个缓存来加速目录查找(图 7中标记为“ dir Lookup ” )。(3) 将页面大小从默认的 8KB 增加到 Shore 允许的最大大小 32KB(图 7中标记为“小页面”))。较大的页面虽然不适合基于磁盘的 OLTP,但可以通过减少 B 树中的级别数(由于节点大小较大)来帮助主存驻留数据库,并导致新的页面分配频率降低创造的记录。另一种方法是将 B 树节点的大小减小到缓存行的大小,如 [ RR99 ] 中建议的那样,但这需要删除 B 树节点和 Shore 页面之间的关联,或者减少Shore 页面低于 1KB(Shore 不允许)。(4) 通过将事务合并到单个会话中(图 7 中标记为“Xactions” 消除了为每个事务设置和终止会话的开销,以及对正在运行的事务的相关监视。

Miscellaneous optimizations. There were four optimizations we made that can be invoked at any point during the process of removing the above-mentioned components. These were the following. (1) Accelerating the B-tree code by hand-coding node searches to optimize for the common case that keys are uncompressed integers (labeled “Btree keys” in Figures 5-8). (2) Accelerating directory lookups by using a single cache for all transactions (labeled “dir lookup” in Figure 7). (3) Increasing the page size from the default size of 8KB to 32KB, the maximum allowable in Shore (labeled “small page” in Figure 7). Larger pages, although not suitable for disk-based OLTP, can help in a main-memory resident database by reducing the number of levels in a B-tree (due to the larger node size), and result in less frequent page allocations for newly created records. An alternative would be to decrease the size of a B-tree node to the size of a cache line as proposed in [RR99], but this would have required removing the association between a B-tree node and a Shore page, or reducing a Shore page below 1KB (which Shore does not allow). (4) Removing the overhead of setting up and terminating a session for each transaction, along with the associated monitoring of running transactions, by consolidating transactions into a single session (labeled “Xactions” in Figure 7).

我们对 Shore 的全套更改/优化以及基准套件和有关如何运行实验的文档均可在线获取。2接下来,我们进入本文的性能部分。

Our full set of changes/optimizations to Shore, along with the benchmark suite and documentation on how to run the experiments are available online.2 Next, we move to the performance section of the paper.

4  性能研究

4  Performance Study

本节的组织方式如下。首先,我们描述我们使用的 TPC-C 基准测试的变体(第 4.1 节)。在第 4.2 节中,我们提供了硬件平台、实验设置以及用于收集性能数据的工具的详细信息。第 4.3 节介绍了一系列结果,详细介绍了我们逐步应用优化和删除组件时的 Shore 性能。

The section is organized as follows. First we describe our variant of the TPC-C benchmark that we used (Section 4.1). In Section 4.2 we provide details of the hardware platform, the experimental setup, and the tools we used for collecting the performance numbers. Section 4.3 presents a series of results, detailing Shore performance as we progressively apply optimizations and remove components.

图像

图 3   TPC-C 架构

Figure 3  TPC-C Schema

4.1   OLTP 工作负载

4.1  OLTP Workload

我们的基准源自 TPC-C [ TPCC ],它模拟了一家在多个仓库及其相关销售区域运营的批发零件供应商。TPC-C 旨在代表任何必须管理、销售或分销产品或服务的行业。它旨在随着供应商的扩张和新仓库的创建而扩展。规模化要求是每个仓库必须供应10个销售区域,每个区域必须服务3000个客户。图 3显示了数据库架构以及扩展要求(作为仓库数量 W 的函数)。一个仓库的数据库大小约为 100 MB(我们实验了 5 个仓库,总大小为 500 MB)。

Our benchmark is derived from TPC-C [TPCC], which models a wholesale parts supplier operating out of a number of warehouses and their associated sales districts. TPC-C is designed to represent any industry that must manage, sell, or distribute a product or service. It is designed to scale as the supplier expands and new warehouses are created. The scaling requirement is that each warehouse must supply 10 sales districts, and each district must serve 3000 customers. The database schema along with the scaling requirements (as a function of the number of warehouses W) is shown in Figure 3. The database size for one warehouse is approximately 100 MB (we experiment with five warehouses for a total size of 500MB).

TPC-C 涉及五个不同类型和复杂性的并发事务的混合。这些交易包括输入订单(新订单交易)、记录付款(付款)、交付订单、检查订单状态以及监控仓库的库存水平。TPC-C 还指定大约 90% 的时间执行前两个事务。为了本文的目的,并为了更好地理解我们干预的效果,我们仅尝试了前两项交易的混合。它们的代码结构(调用 Shore)如图4所示。我们对原始规格做了以下小改动,以实现实验的可重复性:

TPC-C involves a mix of five concurrent transactions of different types and complexity. These transactions include entering orders (the New Order transaction), recording payments (Payment), delivering orders, checking the status of orders, and monitoring the level of stock at the warehouses. TPC-C also specifies that about 90% of the time the first two transactions are executed. For the purposes of the paper, and for better understanding the effect of our interventions, we experimented with a mix of only the first two transactions. Their code structure (calls to Shore) is shown in Figure 4. We made the following small changes to the original specifications, to achieve repeatability in the experiments:

图像

图 4  调用 Shore 的新订单和付款交易方法。

Figure 4  Calls to Shore’s methods for New Order and Payment transactions.

新命令。每笔新订单交易都会下达 5-15 件商品的订单,其中 90% 的订单全部由客户“本地”仓库的库存供应(10% 需要访问属于远程仓库的库存),1% 的订单全部由客户“本地”仓库的库存供应。提供的项目无效(在 B 树中找不到)。为了避免结果发生变化,我们将商品数量设置为 10,并始终从本地仓库提供订单。这两个更改不会影响吞吐量。图 4中的代码显示了 2.5 节中提到的两阶段优化,这使我们能够避免中止事务;我们从一开始就读取所有项目,如果发现无效的项目,我们将中止而不重做数据库中的更改。

New Order. Each New Order transaction places an order for 5-15 items, with 90% of all orders supplied in full by stocks from the customer’s “home” warehouse (10% need to access stock belonging to a remote warehouse), and with 1% of the provided items being an invalid one (it is not found in the B-tree). To avoid variation in the results we set the number of items to 10, and always serve orders from a local warehouse. These two changes do not affect the throughput. The code in Figure 4 shows the two-phase optimization mentioned in Section 2.5, which allows us to avoid aborting a transaction; we read all items at the beginning, and if we find an invalid one we abort without redoing changes in the database.

支付。这是一个轻量级的交易;它更新客户的余额和仓库/区域销售字段,并生成历史记录。同样,可以选择主仓库和远程仓库,我们始终将其设置为主仓库。另一个随机设置的输入是通过姓名还是 ID 查找客户,我们总是使用 ID。

Payment. This is a lightweight transaction; it updates the customer’s balance and warehouse/district sales fields, and generates a history record. Again, there is a choice of home and remote warehouse which we always set to the home one. Another randomly set input is whether a customer is looked up by name or ID, and we always use ID.

4.2  设置和测量方法

4.2  Setup and Measurement Methodology

所有实验均在单核 Pentium 4 3.2GHz、1MB L2 缓存、禁用超线程、1GB RAM、运行 Linux 2.6 上进行。我们使用 gcc 版本 3.4.4 和 O2 优化进行编译。我们使用标准 Linux 实用程序 iostat 来监视磁盘活动,并在主内存驻留实验中验证是否没有生成磁盘流量。在所有实验中,我们将整个数据库预加载到主内存中。然后我们运行大量交易(40,000)。吞吐量是通过将时钟时间除以已完成的事务数量来直接测量的。

All experiments are performed on a single-core Pentium 4 3.2GHz, with 1MB L2 cache, hyperthreading disabled, 1GB RAM, running Linux 2.6. We compiled with gcc version 3.4.4 and O2 optimizations. We use the standard linux utility iostat to monitor disk activity and verify in the main memory-resident experiments there is no generated disk traffic. In all experiments we pre-load the entire database into the main memory. Then we run a large number of transactions (40,000). Throughput is measured directly by dividing wall clock time by the number of completed transactions.

对于详细的指令和周期计数,我们通过调用 PAPI 库 [ MBD+99 ] http://icl.cs.utk.edu/papi/来检测基准应用程序代码,该库提供对 CPU 性能计数器的访问。由于我们每次调用 Shore 后都会调用 PAPI,因此在报告最终数字时,我们必须补偿 PAPI 调用的成本。这些指令的数量为 535-537,在我们的机器上占用了 1350 到 1500 个周期。我们对所有 40,000 笔交易的每次呼叫 Shore 进行测量并报告平均数。

For detailed instruction and cycle counts we instrument the benchmark application code with calls to the PAPI library [MBD+99] http://icl.cs.utk.edu/papi/, which provides access to the CPU performance counters. Since we make a call to PAPI after every call to Shore, we have to compensate for the cost of PAPI calls when reporting the final numbers. These had an instruction count of 535-537 and were taking between 1350 and 1500 cycles in our machine. We measure each call to Shore for all 40,000 transactions and report the average numbers.

论文中报告的大多数图表都是基于 CPU 指令计数(通过 CPU 性能计数器测量)而不是挂钟时间。原因是指令计数代表了总的运行时代码路径长度,并且它们是确定性的。由于不同的微架构行为(高速缓存未命中、TLB 未命中等),不同组件之间的相同指令数当然会导致不同的挂钟执行时间(CPU 周期)。在第 4.3.4 节中,我们将指令计数与 CPU 周期进行比较,说明具有高微架构效率的组件,这可归因于很少的 L2 缓存未命中和良好的指令级并行性等问题。

Most of the graphs reported in the paper are based on CPU instruction counts (as measured through the CPU performance counters) and not wall clock time. The reason is that instruction counts are representative of the total run-time code path length, and they are deterministic. Equal instruction counts among different components can of course result in different wall clock execution times (CPU cycles), because of different microarchitectural behavior (cache misses, TLB misses, etc.). In Section 4.3.4 we compare instruction counts to CPU cycles, illustrating the components where there is high micro-architectural efficiency that can be attributed to issues like few L2 cache misses and good instruction-level parallelism.

然而,周期计数容易受到各种参数的影响,从 CPU 特性(例如高速缓存大小/关联性、分支预测器、TLB 操作)到运行时变量(例如并发进程)。因此,它应该被视为相对时间细分的指示。在本文中,我们不会详细讨论 CPU 缓存性能问题,因为我们的重点是确定要删除的 DBMS 组件集,这些组件可以为某些类别的 OLTP 工作负载带来高达两个数量级的性能提升。有关数据库工作负载的微架构行为的更多信息可以在其他地方找到 [ Ail04 ]。

Cycle count, however, is susceptible to various parameters, ranging from CPU characteristics, such as cache size/associativity, branch predictors, TLB operation, to run-time variables such as concurrent processes. Therefore it should be treated as indicative of relative time breakdown. We do not expand on the issue of CPU cache performance in this paper, as our focus is to identify the set of DBMS components to remove that can produce up to two orders of magnitude better performance for certain classes of OLTP workloads. More information on the micro-architectural behavior of database workloads can be found elsewhere [Ail04].

接下来,我们开始展示我们的结果。

Next, we begin the presentation of our results.

4.3 实验结果

4.3 Experimental Results

在所有实验中,我们的基准 Shore 平台是一个驻留内存的数据库,它永远不会刷新到磁盘(唯一可能执行的磁盘 I/O 来自日志管理器)。一次只有一个线程执行一个事务。屏蔽 I/O(在基于磁盘的日志记录的情况下)不是一个问题,因为它只会增加总体响应时间,而不增加事务实际运行的指令或周期。

In all experiments, our baseline Shore platform is a memory-resident database that is never flushed to disk (the only disk I/O that might be performed is from the Log Manager). There is only a single thread executing one transaction at a time. Masking I/O (in the case of disk-based logging) is not a concern as it only adds to overall response time and not to the instructions or cycles that the transaction has actually run.

我们在 Shore 中放置了 11 个不同的开关,以允许我们删除功能(或执行优化),在结果呈现期间,我们将其组织为六个组件。有关 11 个开关(以及相应组件)的列表以及我们应用它们的顺序,请参见图7。这些开关在上面的 3.2 节中有更详细的描述。最后一个切换是完全绕过 Shore 并运行我们自己的、开销最小的内核,我们在结果中将其称为“最佳”。该内核基本上是一个驻留在内存中、手工构建的 B 树包,没有额外的事务或查询处理功能。

We placed 11 different switches in Shore to allow us to remove functionality (or perform optimizations), which, during the presentation of the results, we organize into six components. For a list of the 11 switches (and the corresponding components) and the order we apply them, see Figure 7. These switches were described in more detail in Section 3.2 above. The last switch is to bypass Shore completely and run our own, minimal-overhead kernel, which we call “optimal” in our results. This kernel is basically a memory-resident, hand-built B-tree package with no additional transaction or query processing functionality.

4.3.1 对吞吐量的影响

4.3.1 Effect on Throughput

在所有这些删除和优化之后,Shore 留下了代码残留,这是所有 CPU 周期,因为没有任何 I/O;具体来说,每笔交易平均约 80 微秒(对于 50-50 笔新订单和付款交易的混合),或每秒约 12,700 笔交易。

After all of these deletions and optimizations, Shore is left with a code residue, which is all CPU cycles since there is no I/O whatsoever; specifically, an average of about 80 microseconds per transaction (for a 50-50 mix of New Order and Payment transactions), or about 12,700 transactions per second.

相比之下,我们的最佳系统中的有用工作约为每个事务 22 微秒,即每秒约 46,500 个事务。造成这种差异的主要原因是我们内核中的调用堆栈深度更深,并且我们无法在不破坏 Shore 的情况下删除一些事务设置和缓冲池调用。作为参考,“开箱即用”的 Shore 启用了日志记录,但数据库缓存在主内存中,每秒提供大约 640 个事务(每个事务 1.6 毫秒),而 Shore 在主内存中运行,但没有日志刷新每秒提供约 1,700 个事务,或每个事务约 588 微秒。因此,我们的修改使整体吞吐量提高了 20 倍。

In comparison, the useful work in our optimal system was about 22 microseconds per transaction, or about 46,500 transactions per second. The main causes of this difference are a deeper call stack depth in our kernel, and our inability to remove some of the transaction set up and buffer pool calls without breaking Shore. As a point of reference, “out of the box” Shore, with logging enabled but with the database cached in main memory, provides about 640 transactions per second (1.6 milliseconds per transaction), whereas Shore running in main memory, but without log flushing provides about 1,700 transactions per second, or about 588 microseconds per transaction. Hence, our modifications yield a factor of 20 improvement in overall throughput.

考虑到这些基本的吞吐量测量,我们现在给出基准测试中两个事务的详细指令细分。回想一下,以下部分中的指令和周期细分不包括磁盘操作的任何影响,而基线 Shore 的吞吐量数据确实包括一些日志写入操作。

Given these basic throughput measurements, we now give detailed instruction breakdowns for the two transactions of our benchmark. Recall that the instruction and cycle breakdowns in the following sections do not include any impact of disk operations, whereas the throughput numbers for baseline Shore do include some log write operations.

图像

图 5  支付交易的详细指令计数细分。

Figure 5  Detailed instruction count breakdown for Payment transaction.

4.3.2 付款

4.3.2  Payment

图 5(左侧)显示了由于我们优化了 B 树密钥评估并删除了日志记录、锁定、锁存和缓冲区管理器功能,支付交易的指令数有所减少。该图的右侧部分显示,对于我们执行的每个功能删除,其对事务执行的各个部分所花费的指令数量的影响。对于支付交易,这些部分包括开始调用、三个 B 树查找、三个固定/取消固定操作、三个更新(通过 B 树)、一个记录创建和一个提交调用。每个条形的高度始终是执行的指令总数。最右边的栏是我们的最小开销内核的性能。

Figure 5 (left side) shows the reductions in the instruction count of the Payment transaction as we optimized B-tree key evaluations and removed logging, locking, latching, and buffer manager functionality. The right part of the figure shows, for each feature removal we perform, its effect on the number of instructions spent in various portions of the transaction’s execution. For the Payment transaction, these portions include a begin call, three B-tree lookups followed by three pin/unpin operations, followed by three updates (through the B-tree), one record creation and a commit call. The height of each bar is always the total number of instructions executed. The right-most bar is the performance of our minimal-overhead kernel.

据报道,我们的 B 树键评估优化是高性能 DBMS 架构中的标准实践,因此我们首先执行它们,因为任何系统都应该能够做到这一点。删除日志记录主要影响提交和更新,因为这些是写入日志记录的代码部分,并且在较小程度上影响 B 树和目录查找。这些修改减少了总指令数的约 18%。

Our B-tree key evaluation optimizations are reportedly standard practice in high-performance DBMS architectures, so we perform them first because any system should be able to do this. Removing logging affects mainly commits and updates, as those are the portions of the code that write log records, and to a lesser degree B-tree and directory lookups. These modifications remove about 18% of the total instruction count.

锁定指令数量位居第二,约占总数的 25%。删除它会影响所有代码,但在固定/取消固定操作、查找和提交中尤其重要,这是预期的,因为这些是必须获取或释放锁的操作(执行更新时事务已经对更新的记录拥有锁)。

Locking takes the second most instructions, accounting for about 25% of the total count. Removing it affects all of the code, but is especially important in the pin/unpin operations, the lookups, and commits, which was expected as these are the operations that must acquire or release locks (the transaction already has locks on the updated records when the updates are performed).

锁存约占指令的 13%,并且在事务的创建记录和 B 树查找部分中非常重要。这是因为缓冲池(在 create 中使用)和 B 树是必须使用锁存器保护的主要共享数据结构。

Latching accounts for about 13% of the instructions, and is primarily important in the create record and B-tree lookup portions of the transaction. This is because the buffer pool (used in create) and B-trees are the primary shared data structures that must be protected with latches.

最后,我们的缓冲区管理器修改约占总指令数的 30%。回想一下,通过这组修改,新记录直接使用 malloc 分配,并且大多数情况下页面查找不再需要通过缓冲池。这使得记录分配基本上是免费的,并大大提高了执行频繁查找(例如 B 树查找和更新)的其他组件的性能。

Finally, our buffer manager modifications account for about 30% of the total instruction count. Recall that with this set of modifications, new records are allocated directly with malloc, and page lookups no longer have to go through the buffer pool in most cases. This makes record allocation essentially free, and substantially improves the performance of other components that perform frequent lookups, like B-tree lookup and update.

此时,剩余内核需要初始指令总数的大约 5%(性能提升 20 倍!),大约是我们“最佳”系统指令总数的 6 倍。该分析得出两个观察结果:首先,所有六个主要组成部分都很重要,每个组成部分占最初 18 万条指令中的 18,000 条或更多指令。其次,在应用所有优化之前,指令数的减少并不显着:在删除缓冲区管理器的最后一步之前,其余组件使用的指令比基准系统少大约三倍(相对于基准系统的 20 倍)当缓冲区管理器被删除时)。

At this point, the remaining kernel requires about 5% (for a 20x performance gain!) of the total initial instruction count, and is about 6 times the total instructions of our “optimal” system. This analysis leads to two observations: first, all six of the major components are significant, each accounting for 18 thousand or more instructions of the initial 180 thousand. Second, until all of our optimizations are applied, the reduction in instruction count is not dramatic: before our last step of removing the buffer manager, the remaining components used about a factor of three fewer instructions than the baseline system (versus a factor of 20 when the buffer manager is removed).

4.3.3 新订单

4.3.3  New Order

图 6显示了新订单事务中指令计数的类似细分;图7显示了我们执行的所有 11 项修改和优化的详细说明。该交易使用的指令数量大约是支付交易的 10 倍,需要 13 次 B 树插入、12 次记录创建操作、11 次更新、23 次固定/取消固定操作和 23 次 B 树查找。New Order 和 Payment 之间的主要优化指令分配的主要区别在于 B 树关键代码、日志记录和锁定。由于 New Order 在混合操作中添加了 B 树插入,因此通过优化关键评估代码可以获得更多相对收益(约 16%)。日志和锁定现在只占总指令的12%和16%左右;这主要是因为在这种情况下,记录和锁定执行大量工作的操作所花费的总时间要少得多。

A similar breakdown of the instruction count in the New Order transaction is shown in Figure 6; Figure 7 shows a detailed accounting of all 11 modifications and optimizations we performed. This transaction uses about 10 times as many instructions as the Payment transaction, requiring 13 B-tree inserts, 12 record creation operations, 11 updates, 23 pin/unpin operations, and 23 B-tree lookups. The main differences in the allocation of instructions to major optimizations between New Order and Payment are in B-tree key code, logging, and locking. Since New Order adds B-tree insertions in the mix of operations, there is more relative benefit to be had by optimizing the key evaluation code (about 16%). Logging and locking now only account for about 12% and 16% of the total instructions; this is largely because the total fraction of time spent in operations where logging and locking perform a lot of work is much smaller in this case.

缓冲区管理器优化仍然是这里最重要的胜利,因为我们能够绕过记录创建的高开销。查看图 7中缓冲区管理器优化的详细细分,会发现一些令人惊讶的事情:从 8K 页面更改为 32K 页面(标记为“小页面”),总指令数减少了近 14%。这种简单的优化(用于减少页面分配的频率并减少 B 树深度)提供了相当大的收益。

The buffer manager optimizations still represent the most significant win here, again because we are able to bypass the high overhead of record creation. Looking at the detailed breakdown in Figure 7 for the buffer manager optimization reveals something surprising: changing from 8K to 32K pages (labelled “small page”) provides almost a 14% reduction in the total instruction count. This simple optimization—which serves to reduce the frequency of page allocations and decrease B-tree depth—offers a sizeable gain.

图像

图 6  新订单交易的详细指令计数细分。

Figure 6  Detailed instruction count breakdown for New Order transaction.

图像

图 7  新订单的扩展细分(左列标签请参见第 3.2 节)。

Figure 7  Expanding breakdown for New Order (see Section 3.2 for the labels on the left column).

图像

图 8  新订单的指令(左)与周期(右)。

Figure 8  Instructions (left) vs. Cycles (right) for New Order.

4.3.4 指令与周期

4.3.4  Instructions vs. Cycles

在查看了支付和新订单交易中指令计数的详细分类之后,我们现在将新订单交易每个阶段所花费的时间(周期)比例与每个阶段所使用的指令比例进行比较。结果如图8。正如我们之前提到的,我们不期望这两个分数对于给定阶段是相同的,因为缓存未命中和管道停顿(通常由于分支)可能导致某些指令比其他指令花费更多的周期。例如,B 树优化减少的周期少于减少的指令,因为我们删除的 Shore B 树代码开销主要是偏移计算,很少有缓存未命中。相反,我们的剩余“内核”使用比指令更大的周期部分,因为它是分支密集型的,主要由函数调用组成。同样,日志记录使用更多的周期,因为它涉及大量内存创建和写入日志记录(磁盘 I/O 时间不包括在任一图表中)。最后,锁定和缓冲区管理器消耗的周期百分比与指令消耗的周期百分比大致相同。

Having looked at the detailed breakdown of instruction counts in the Payment and New Order transactions, we now compare the fraction of time (cycles) spent in each phase of the New Order transaction to the fraction of instructions used in each phase. The results are shown in Figure 8. As we noted earlier, we do not expect these two fractions to be identical for a given phase, because cache misses and pipeline stalls (typically due to branches) can cause some instructions to take more cycles than others. For example, B-tree optimizations reduce cycles less than they reduce instructions, because the Shore B-tree code overhead we remove is mainly offset calculations with few cache misses. Conversely, our residual “kernel” uses a larger fraction of cycles than it does instructions, because it is branch-heavy, consisting mostly of function calls. Similarly, logging uses significantly more cycles because it touches a lot of memory creating and writing log records (disk I/O time is not included in either graph). Finally, locking and the buffer manager consume about the same percentage of cycles as they do instructions.

  对未来 OLTP 引擎的5 个启示

5  Implications for Future OLTP Engines

鉴于上一节中的性能结果,我们重新审视第 2 节中对未来 OLTP 设计的讨论。在详细讨论我们的结果对各种数据库子系统设计的影响之前,我们从数据中进行两个高级观察:

Given the performance results in the previous section, we revisit our discussion of future OLTP designs from Section 2. Before going into the detailed implications of our results for the design of various database subsystems, we make two high level observations from our numbers:

• 首先,剥离系统的任何一个组件对整体性能的好处相对较小。例如,我们的主内存优化将 Shore 的性能提高了约 30%,这很重要,但不太可能促使主要数据库供应商重新设计他们的系统。通过消除锁存或切换到单线程、一次一个事务的方法,可以获得类似的收益。

•  First, the benefit of stripping out any one of the components of the system has a relatively small benefit on overall performance. For example, our main memory optimizations improved the performance of Shore by about 30%, which is significant but unlikely to motivate the major database vendors to re-engineer their systems. Similar gains would be obtained by eliminating just latching or switching to a single-threaded, one-transaction-at-a-time approach.

• 当应用多种优化时,可以获得最显着的收益。与开箱即用的 Shore 相比,完全精简的系统可提供 20 倍或更多的性能增益,这确实非常重要。请注意,这样的系统仍然可以提供事务语义,如果一次只运行一个事务,则所有事务都是两阶段的,并且通过从网络中的其他节点复制状态来实现恢复。然而,这样的系统与任何供应商当前提供的系统非常非常不同。

•  The most significant gains are to be had when multiple optimizations are applied. A fully stripped down system provides a factor of twenty or more performance gain over out-of-the-box Shore, which is truly significant. Note that such a system can still provide transactional semantics, if only one transaction is run at a time, all transactions are two phase, and recovery is implemented by copying state from other nodes in the network. Such a system is very, very different from what any of the vendors currently offers, however.

5.1 并发控制

5.1 Concurrency Control

我们的实验表明动态锁定对总开销的贡献很大(约占周期的 19%)。这表明,通过识别允许关闭并发控制的场景(例如应用程序交换性或一次事务处理)可以获得很大的收益。然而,有许多 DBMS 应用程序的性能不够好,或者每个站点一次仅运行一个事务是行不通的。在这种情况下,存在一个有趣的问题:哪种并发控制协议最好。二十年前,各种研究人员 [ KR81ACL87] 进行了详尽的模拟,清楚地表明了动态锁定相对于其他并发控制技术的优越性。然而,这项工作假设基于磁盘的负载存在磁盘停顿,这显然会对结果产生重大影响。非常希望用主存储器工作负载重做这些类型的模拟研究。我们强烈怀疑某种乐观的并发控制将是普遍的选择。

Our experiments showed a significant contribution (about 19% of cycles) of dynamic locking to total overhead. This suggests that there is a large gain to be had by identifying scenarios, such as application commutativity, or transaction-at-a-time processing, that allow concurrency control to be turned off. However, there are many DBMS applications which are not sufficiently well-behaved or where running only one transaction at a time per site will not work. In such cases, there is an interesting question as to what concurrency control protocol is best. Twenty years ago, various researchers [KR81, ACL87] performed exhaustive simulations that clearly showed the superiority of dynamic locking relative to other concurrency control techniques. However, this work assumed a disk-based load with disk stalls, which obviously impacts the results significantly. It would be highly desirable to redo these sorts of simulation studies with a main memory workload. We strongly suspect that some sort of optimistic concurrency control would be the prevailing option.

5.2 多核支持

5.2  Multi-core Support

鉴于多核计算机的日益普及,一个有趣的问题是未来的 OLTP 引擎应如何处理多核。一种选择是在单个站点内的不同核心上同时运行多个事务(就像现在所做的那样);当然,这种并行性需要锁存并意味着许多资源分配问题。我们的实验表明,虽然锁存的性能开销不是特别高(占主导地位的事务 New Order 中周期的 10%),但它仍然是在 OLTP 中实现显着性能改进的障碍。随着技术(例如事务内存[ HM93])为了在成熟的多核机器上高效运行高度并发的程序并找到进入产品的方式,重新审视新的锁存实现并重新评估 OLTP 中多线程的开销将是非常有趣的。

Given the increasing prevalence of many-core computers, an interesting question is how future OLTP engines should deal with multiple cores. One option is to run multiple transactions concurrently on separate cores within a single site (as it is done today); of course, such parallelism requires latching and implies a number of resource allocation issues. Our experiments show that although the performance overhead of latching is not particularly high (10% of cycles in the dominant transaction, New Order), it still remains an obstacle in achieving significant performance improvements in OLTP. As technologies (such as transactional memory [HM93]) for efficiently running highly concurrent programs on multicore machines mature and find their way into products, it will be very interesting to revisit new implementations for latching and reassess the overhead of multithreading in OLTP.

第二种选择是在操作系统或 DBMS 级别使用虚拟化,使每个核心看起来都是单线程机器。目前还不清楚这种方法对性能的影响是什么,需要对这种虚拟化进行仔细研究。第三种选择是对其他两种选择的补充,是尝试利用查询内并行性,即使系统一次只运行一个事务,这也可能是可行的。然而,典型 OLTP 事务中可用的查询内并行量可能是有限的。

A second option is to use virtualization, either at the operating system or DBMS level, to make it appear that each core is a single-threaded machine. It is unclear what the performance implications of that approach would be, warranting a careful study of such virtualization. A third option, complementary to the other two, is to attempt to exploit intra-query parallelism, which may be feasible even if the system only runs one transaction at a time. However, the amount of intra-query parallelism available in a typical OLTP transaction is likely to be limited.

5.3  复制管理

5.3  Replication Management

传统的数据库智慧是通过基于日志传送的主动-被动方案来支持复制;也就是说,每个对象都有一个“活动”主副本,所有更新都首先定向到该副本。然后,更改日志通过网络发送到一个或多个“被动”备份站点。恢复逻辑从日志向前滚动远程数据库。该方案有几个缺点。首先,除非使用两阶段提交的形式,否则远程副本在事务上与主副本不一致。因此,如果需要事务一致性读取,则无法将读取定向到副本。如果读取定向到副本,则无法评价答案的准确性。第二个缺点是故障转移不是即时的。因此,故障期间的停顿时间比需要的时间要长。第三,它需要日志的可用性;我们的实验表明,维护日志大约需要总周期的 20%。因此,我们认为考虑主动-被动复制的替代方案很有趣,例如主动-主动方法。

The traditional database wisdom is to support replication through a log-shipping based active-passive scheme; namely, every object has an “active” primary copy, to which all updates are first directed. The log of changes is then spooled over the network to one or more “passive” backup sites. Recovery logic rolls the remote database forward from the log. This scheme has several disadvantages. First, unless a form of two-phase commit is used, the remote copies are not transactionally consistent with the primary. Hence, reads cannot be directed to replicas if transaction-consistent reads are required. If reads are directed to replicas, nothing can be said about the accuracy of the answers. A second disadvantage is that failover is not instantaneous. Hence, the stall during failures is longer than it needs to be. Third, it requires the availability of a log; our experiments show that maintaining a log takes about 20% of total cycles. Hence, we believe it is interesting to consider alternatives to active-passive replication, such as an active-active approach.

过去使用带有日志传送的主动-被动复制的主要原因是,前滚日志的成本被认为远低于在副本上执行事务逻辑的成本。然而,在主存 DBMS 中,事务的成本通常小于 1 毫秒,需要的周期非常少,因此可能不会比回放日志慢多少。在这种情况下,替代的主动-主动架构似乎很有意义。在这种情况下,所有副本都是“活动的”,并且事务在所有副本上同步执行。这种方法的优点是几乎即时故障转移,并且不需要首先将更新定向到主副本。当然,在这种情况下,两阶段提交将引入大量额外的延迟,这表明需要避免这种延迟的技术 - 也许通过按时间戳顺序执行事务来实现。

The main reason that active-passive replication with log shipping has been used in the past is that the cost of rolling the log forward has been assumed to be far lower than the cost of performing the transaction logic on the replica. However, in a main memory DBMS, the cost of a transaction is typically less than 1 msec, requiring so few cycles that it is likely not much slower than playing back a log. In this case, an alternate active-active architecture appears to make sense. In this case, all replicas are “active” and the transaction is performed synchronously on all replicas. The advantage of this approach is nearly instantaneous failover and there is no requirement that updates be directed to a primary copy first. Of course, in such a scenario, two-phase commit will introduce substantial additional latency, suggesting that techniques to avoid it are needed—perhaps by performing transactions in timestamp order.

5.4  弱一致性

5.4  Weak Consistency

大多数面向 Web 的大型 OLTP 商店都坚持使用副本(通常通过 WAN)来实现高可用性和灾难恢复。然而,似乎没有人愿意为 WAN 上的事务一致性付费。如第 2 节所述,Web 应用程序中常见的说法是“最终一致性”[ Bre00DHJ+07 ]。通常,这种方法的支持者主张通过非技术手段解决不一致问题;例如,向提出投诉的客户提供信用比确保 100% 的一致性要便宜。换句话说,如果系统处于静止状态,副本最终会变得一致。

Most large web-oriented OLTP shops insist on replicas, usually over a WAN, to achieve high availability and disaster recovery. However, seemingly nobody is willing to pay for transactional consistency over a WAN. As noted in Section 2, the common refrain in web applications is “eventual consistency” [Bre00, DHJ+07]. Typically, proponents of such approach advocate resolving inconsistencies through non-technical means; for example, it is cheaper to give a credit to a customer who complains than to ensure 100% consistency. In other words, the replicas eventually become consistent, presumably if the system is quiesced.

应该清楚的是,在一般工作负载下,如果没有事务一致性,最终一致性是不可能的。例如,假设事务 1 在站点 1 提交,并在站点 2 中止或丢失。事务 2 读取事务 1 的结果并写入数据库,导致不一致传播并污染系统。也就是说,显然,必须存在可实现最终一致性的工作负载,并且寻找它们将是一个有趣的练习,因为如上所述,我们的结果表明从主内存系统中删除事务支持(锁定和日志记录)可以产生非常高性能的数据库。

It should be clear that eventual consistency is impossible without transaction consistency under a general workload. For example, suppose transaction 1 commits at site 1 and aborts or is lost at site 2. Transaction 2 reads the result of transaction 1 and writes into the database, causing the inconsistency to propagate and pollute the system. That said, clearly, there must be workloads where eventual consistency is achievable, and it would be an interesting exercise to look for them, since, as noted above, our results suggest that removing transactional support—locking and logging—from a main memory system could yield a very high performance database.

5.5  缓存感知的B树

5.5  Cache-conscious B-trees

在我们的研究中,我们没有将 Shore B 树转换为“缓存感知”格式 [ RR99RR00 ]。这样的改变,至少在没有我们提出的所有其他优化的系统上,只会产生适度的影响。B 树的缓存感知研究针对的是由于访问存储在 B 树节点中的键值而导致的缓存未命中。我们的优化消除了 B 树操作所花费的 80% 到 88% 的时间,并且没有改变键访问模式。从精简的 Shore 切换到我们的最小开销内核——它仍然可以访问相同的数据——删除了四分之三的剩余时间。换句话说,优化其他组件(例如并发控制和恢复)似乎比优化数据结构更重要。然而,一旦我们将系统精简为非常基本的内核,B 树代码中的缓存未命中很可能成为新的瓶颈。事实上,其他索引结构(例如哈希表)可能在这个新环境中表现更好。再次强调,这些猜想应该经过仔细检验。

In our study we did not convert Shore B-trees to a “cache-conscious” format [RR99, RR00]. Such an alteration, at least on a system without all of the other optimizations we present, would have only a modest impact. Cache-conscious research on B-trees targets cache misses that result from accessing key values stored in the B-tree nodes. Our optimizations removed between 80% to 88% of the time spent in B-tree operations, without changing the key access pattern. Switching from a stripped-down Shore to our minimal-overhead kernel—which still accesses the same data—removed three quarters of the remaining time. In other words, it appears to be more important to optimize other components, such as concurrency control and recovery, than to optimize data structures. However, once we strip a system down to a very basic kernel, cache misses in the B-tree code may well be the new bottleneck. In fact, it may be the case that other indexing structures, such as hash tables, perform better in this new environment. Again, these conjectures should be carefully tested.

6    相关工作

6    Related Work

对于现代数据库系统的性能瓶颈已有多项研究。[ BMK99 ] 和 [ ADH+99 ] 显示主内存数据停顿对数据库性能的贡献越来越大。[ MSA+04 ] 从客户端的角度打破了由于各种资源(例如锁、I/O 同步或 CPU)争用而产生的瓶颈(包括由于 I/O 停顿和其他并发的抢占式调度而导致的感知延迟)查询)。与此处介绍的工作不同,这些论文分析完整的数据库,而不分析每个数据库组件的性能。基准测试研究,例如 OLTP 空间中的 TPC-B [ Ano85 ] 和威斯康星基准测试 [ BDT83] 在一般 SQL 处理中,还表征整个数据库的性能,而不是单个 OLTP 组件的性能。

There have been several studies of performance bottlenecks in modern database systems. [BMK99] and [ADH+99] show the increasing contribution of main memory data stalls to database performance. [MSA+04] breaks down bottlenecks due to contention for various resources (such as locks, I/O synchronization, or CPU) from the client’s point of view (which includes perceived latency due to I/O stalls and preemptive scheduling of other concurrent queries). Unlike the work presented here, these papers analyze complete databases and do not analyze performance per database component. Benchmarking studies such as TPC-B [Ano85] in the OLTP space and the Wisconsin Benchmark [BDT83] in general SQL processing, also characterize the performance of complete databases and not that of individual OLTP components.

此外,在主存数据库方面也进行了大量工作。主存索引结构的工作包括 AVL 树 [ AHU74 ] 和 T 树 [ LC86 ]。主存储器适用性的其他技术出现在[ BHT87 ]中。完整的系统包括 TimesTen [ Tim07 ]、DataBlitz [ BBK+98 ] 和 MARS [ Eic87 ]。[ GS92 ]对此领域进行了调查。然而,这些工作都没有尝试隔离开销的组成部分,而这是本文的主要贡献。

Additionally, there has been a large amount of work on main memory databases. Work on main memory indexing structures has included AVL trees [AHU74] and T-trees [LC86]. Other techniques for main memory applicability appear in [BHT87]. Complete systems include TimesTen [Tim07], DataBlitz [BBK+98], and MARS [Eic87]. A survey of this area appears in [GS92]. However, none of this work attempts to isolate the components of overhead, which is the major contribution of this paper.

7    结论

7    Conclusions

我们对 Shore 进行了性能研究,因为我们希望了解现代数据库系统中的时间都花在哪里,并帮助了解最近提出的几种替代数据库架构的潜在性能。通过剥离 Shore 的组件,我们能够生产出一个系统,该系统运行修改后的 TPC-C 基准测试的速度比原始系统快约 20 倍(尽管功能大大减少!)。我们发现缓冲区管理和锁定操作是系统开销的最重要贡献者,但日志记录和锁存操作也很重要。根据这些结果,我们做出了一些有趣的观察。首先,除非去掉所有这些组件,否则主内存优化数据库(或没有事务的数据库,或没有日志记录的数据库)的性能不太可能比大多数数据都存放在 RAM 中的传统数据库好得多。其次,当一个人确实产生了一个完全精简的系统时——例如,单线程的,通过从网络中的其他节点复制状态来实现恢复,适合内存,并使用减少的功能事务——性能比传统的系统要好几个数量级。未修改的系统。这表明最近提出的精简系统 [ WSA97SMA+07 ] 可能是有意义的。

We performed a performance study of Shore motivated by our desire to understand where time is spent in modern database systems, and to help understand what the potential performance of several recently proposed alternative database architectures might be. By stripping out components of Shore, we were able to produce a system that could run our modified TPC-C benchmark about 20 times faster than the original system (albeit with substantially reduced functionality!). We found that buffer management and locking operations are the most significant contributors to system overhead, but that logging and latching operations are also significant. Based on these results, we make several interesting observations. First, unless one strips out all of these components, the performance of a main memory-optimized database (or a database without transactions, or one without logging) is unlikely to be much better than a conventional database where most of the data fit into RAM. Second, when one does produce a fully stripped down system—e.g., that is single threaded, implements recovery via copying state from other nodes in the network, fits in memory, and uses reduced functionality transactions—the performance is orders of magnitude better than an unmodified system. This suggests that recent proposals for stripped down systems [WSA97, SMA+07] may be on to something.

8  致谢

8  Acknowledgments

我们感谢 SIGMOD 审稿人提供的有益评论。这项工作得到了国家科学基金会 0704424 和 0325525 的部分支持。

We thank the SIGMOD reviewers for their helpful comments. This work was partially supported by the National Science Foundation under Grants 0704424 and 0325525.

9    重复性评估

9    Repeatability Assessment

本文的所有结果均经过SIGMOD重复性委员会的验证。本文中使用的代码和/或数据可在http://www.sigmod.org/codearchive/sigmod2008/获取

All the results in this paper were verified by the SIGMOD repeatability committee. Code and/or data used in the paper are available at http://www.sigmod.org/codearchive/sigmod2008/

参考

References

[ ACL87 ] Agrawal, R.、Carey, MJ 和 Livny, M.“并发控制性能建模:替代方案和影响”。ACM 翻译。数据库系统。12(4),1987 年 12 月。

[ACL87] Agrawal, R., Carey, M. J., and Livny, M. “Concurrency control performance modeling: alternatives and implications.” ACM Trans. Database Syst. 12(4), Dec. 1987.

[ AMS+07 ] Aguilera, M.、Merchant, A.、Shah, M.、Veitch, AC 和 Karamanolis, CT“Sinfonia:构建可扩展分布式系统的新范例。” 在过程中。社会保障计划,2007。

[AMS+07] Aguilera, M., Merchant, A., Shah, M., Veitch, A. C., and Karamanolis, C. T. “Sinfonia: a new paradigm for building scalable distributed systems.” In Proc. SOSP, 2007.

[ AHU74 ] Aho, AV、Hopcroft, JE 和 Ullman, JD“计算机算法的设计与分析”。艾迪生韦斯利出版公司,1974 年。

[AHU74] Aho, A. V., Hopcroft, J. E., and Ullman, J. D. “The Design and Analysis of Computer Algorithms.” Addison-Wesley Publishing Company, 1974.

[ ADH+99 ] Ailamaki, A.、DeWitt, DJ、Hill, M. .D. 和 Wood, DA“现代处理器上的 DBMS:时间去哪儿了?” 在过程中。VLDB,1999,266-277。

[ADH+99] Ailamaki, A., DeWitt, D. J., Hill, M. .D., and Wood, D. A. “DBMSs on a Modern Processor: Where Does Time Go?” In Proc. VLDB, 1999, 266-277.

[ Ail04 ] Ailamaki, A.“新硬件的数据库架构”。教程。在过程中。VLDB,2004。

[Ail04] Ailamaki, A. “Database Architecture for New Hardware.” Tutorial. In Proc. VLDB, 2004.

[ Ano85 ] Anon 等人。“交易处理能力的衡量标准。” 数据通信,1985 年 2 月。

[Ano85] Anon et al. “A Measure of Transaction Processing Power.” In Datamation, February 1985.

[ BBK+98 ] Baulier, JD、Bohannon, P.、Khivesara, A. 等人。“DataBlitz 主内存存储管理器:架构、性能和体验。” VLDB 杂志,1998 年。

[BBK+98] Baulier, J. D., Bohannon, P., Khivesara, A., et al. “The DataBlitz Main-Memory Storage Manager: Architecture, Performance, and Experience.” In The VLDB Journal, 1998.

[ BDT83 ] Bitton, D.、DeWitt, DJ 和 Turbyfill, C.“数据库系统基准测试,一种系统方法”。在过程中。VLDB,1983。

[BDT83] Bitton, D., DeWitt, D. J., and Turbyfill, C. “Benchmarking Database Systems, a Systematic Approach.” In Proc. VLDB, 1983.

[ BHT87 ] Bitton, D.、Hanrahan, M. 和 Turbyfill, C.“主内存数据库系统中复杂查询的性能”。在过程中。国际环境与发展委员会,1987。

[BHT87] Bitton, D., Hanrahan, M., and Turbyfill, C. “Performance of Complex Queries in Main Memory Database Systems.” In Proc. ICDE, 1987.

[ BMK99 ] Boncz, PA、Manegold, S. 和 Kersten, ML “针对新瓶颈优化的数据库架构:内存访问。” 在过程中。VLDB,1999。

[BMK99] Boncz, P. A., Manegold, S., and Kersten, M. L. “Database Architecture Optimized for the New Bottleneck: Memory Access.” In Proc. VLDB, 1999.

[ Bre00 ] Brewer,EA“迈向稳健的分布式系统(摘要)。” 在过程中。PODC,2000。

[Bre00] Brewer, E. A. “Towards robust distributed systems (abstract).” In Proc. PODC, 2000.

[ BDR97 ] Bugnion, E.、Devine, S. 和 Rosenblum, M.“Disco:在可扩展的多处理器上运行商用操作系统。” 在过程中。SOSP,1997。

[BDR97] Bugnion, E., Devine, S., and Rosenblum, M. “Disco: running commodity operating systems on scalable multiprocessors.” In Proc. SOSP, 1997.

[ CDF+94 ] 凯里,MJ,德威特,DJ,富兰克林,MJ 等人。“支持持久性应用程序。” 在过程中。西格莫德,1994。

[CDF+94] Carey, M. J., DeWitt, D. J., Franklin, M. J. et al. “Shoring up persistent applications.” In Proc. SIGMOD, 1994.

[ CDG+06 ] Chang, F.、Dean, J.、Ghemawat, S.、Hsieh, WC、Wallach, DA、Burrows, M.、Chandra, T.、Fikes, A. 和 Gruber, RE “Bigtable:结构化数据的分布式存储系统。” 在过程中。开放空间设计研究所,2006。

[CDG+06] Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. “Bigtable: A Distributed Storage System for Structured Data.” In Proc. OSDI, 2006.

[ DG04 ] Dean, J. 和 Ghemawat, S.“MapReduce:大型集群上的简化数据处理”。在过程中。开放空间设计研究所,2004 年。

[DG04] Dean, J. and Ghemawat, S. “MapReduce: Simplified Data Processing on Large Clusters.” In Proc. OSDI, 2004.

[ DHJ+07 ] DeCandia, G.、Hastorun, D.、Jampani, M.、Kakulapati, G.、Lakshman, A.、Pilchin, A.、Sivasubramanian, S.、Vosshall, P. 和 Vogels, W. “Dynamo:亚马逊高度可用的键值存储。” 在过程中。社会保障计划,2007。

[DHJ+07] DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. “Dynamo: amazon’s highly available key-value store.” In Proc. SOSP, 2007.

[ DGS+90 ] DeWitt, DJ、Ghandeharizadeh, S.、Schneider, DA、Bricker, A.、Hsiao, H. 和 Rasmussen, R.“伽马数据库机项目”。IEEE 知识与数据工程汇刊2(1):44-62,1990 年 3 月。

[DGS+90] DeWitt, D. J., Ghandeharizadeh, S., Schneider, D. A., Bricker, A., Hsiao, H., and Rasmussen, R. “The Gamma Database Machine Project.” IEEE Transactions on Knowledge and Data Engineering 2(1):44-62, March 1990.

[ Eic87 ] Eich, MH“MARS:主内存数据库机的设计。” 在过程中。1987 年国际数据库机研讨会,1987 年 10 月。

[Eic87] Eich, M. H. “MARS: The Design of A Main Memory Database Machine.” In Proc. of the 1987 International workshop on Database Machines, October, 1987.

[ GS92 ] Garcia-Molina, H. 和 Salem, K.“主内存数据库系统:概述”。IEEE 传输。知道。数据工程 4(6):509-516(1992)。

[GS92] Garcia-Molina, H. and Salem, K. “Main Memory Database Systems: An Overview.” IEEE Trans. Knowl. Data Eng. 4(6): 509-516 (1992).

[ GR93 ] Gray, J. 和 Reuter, A.“事务处理:概念和技术”。摩根考夫曼出版社,1993 年。

[GR93] Gray, J. and Reuter, A. “Transaction Processing: Concepts and Techniques.” Morgan Kaufmann Publishers, Inc., 1993.

[ GBH+00 ] Gribble, SD、Brewer, EA、Hellerstein, JM 和 Culler, DE “用于互联网服务构建的可扩展分布式数据结构”。在过程中。开放空间设计研究所,2000。

[GBH+00] Gribble, S. D., Brewer, E. A., Hellerstein, J. M., and Culler, D. E. “Scalable, Distributed Data Structures for Internet Service Construction.” In Proc. OSDI, 2000.

[ Hel07 ] Helland, P.“分布式交易之外的生活:叛教者的观点。” 在过程中。CIDR,2007。

[Hel07] Helland, P. “Life beyond Distributed Transactions: an Apostate’s Opinion.” In Proc. CIDR, 2007.

[ HM93 ] Herlihy,MP 和 Moss,JEB “事务内存:对无锁数据结构的架构支持。” 在过程中。伊斯卡,1993。

[HM93] Herlihy, M. P. and Moss, J. E. B. “Transactional Memory: architectural support for lock-free data structures.” In Proc. ISCA, 1993.

[ KR81 ] Kung, HT 和 Robinson, JT“关于并发控制的乐观方法”。ACM 翻译。数据库系统。6(2):213–226,1981年 6 月。

[KR81] Kung, H. T. and Robinson, J. T. “On optimistic methods for concurrency control.” ACM Trans. Database Syst. 6(2):213–226, June 1981.

[ LM06 ] Lau, E. 和 Madden, S.“可更新的分布式数据仓库中恢复和高可用性的集成方法。” 在过程中。VLDB,2006。

[LM06] Lau, E. and Madden, S. “An Integrated Approach to Recovery and High Availability in an Updatable, Distributed Data Warehouse.” In Proc. VLDB, 2006.

[ LC86 ] Lehman, TJ 和 Carey, MJ“主存数据库管理系统索引结构的研究”。在过程中。VLDB,1986。

[LC86] Lehman, T. J. and Carey, M.J. “A study of index structures for main memory database management systems.” In Proc. VLDB, 1986.

[ LGG+91 ] Liskov, B.、Ghemawat, S.、Gruber, R.、Johnson, P.、Shrira, L. 和 Williams, M.“h​​arp 文件系统中的复制”。在过程中。SOSP,第 226-238 页,1991 年。

[LGG+91] Liskov, B., Ghemawat, S., Gruber, R., Johnson, P., Shrira, L., and Williams, M. “Replication in the harp file system.” In Proc. SOSP, pages 226-238, 1991.

[ MSA+04 ] McWherter, DT、Schroeder, B.、Ailamaki, A. 和 Harchol-Balter, M.“OLTP 和事务性 Web 应用程序的优先级机制”。在Proc.ICDE,2004 年。

[MSA+04] McWherter, D. T., Schroeder, B., Ailamaki, A., and Harchol-Balter, M. “Priority Mechanisms for OLTP and Transactional Web Applications.” In Proc.ICDE, 2004.

[ MHL+92 ] Mohan, C.、Haderle, D.、Lindsay, B.、Pirahesh, H. 和 Schwarz, P.“ARIES:一种使用预写日志记录支持细粒度锁定和部分回滚的事务恢复方法”。ACM 翻译。数据库系统。17(1):94-162,1992

[MHL+92] Mohan, C., Haderle, D., Lindsay, B., Pirahesh, H., and Schwarz, P. “ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging.” ACM Trans. Database Syst. 17(1):94-162, 1992.

[ Moh89 ] Mohan, C.“ARIES/KVL:一种用于在 B 树索引上操作的多操作事务并发控制的键值锁定方法。” 1989 年,研究报告 RJ 7008,数据库技术研究所,IBM Almaden 研究中心。

[Moh89] Mohan, C. “ARIES/KVL: A Key-Value Locking Method for Concurrency Control of Multiaction Transactions Operating on B-Tree Indexes.” 1989, Research Report RJ 7008, Data Base Technology Institute, IBM Almaden Research Center.

[ ML89 ] Mohan, C. 和 Levine, F.“ARIES/IM:一种使用预写日志记录的高效高并发索引管理方法。” 1989 年,研究报告 RJ 6846,数据库技术研究所,IBM Almaden 研究中心。

[ML89] Mohan, C. and Levine, F. “ARIES/IM: An Efficient and High Concurrency Index Management Method Using Write-Ahead Logging.” 1989, Research Report RJ 6846, Data Base Technology Institute, IBM Almaden Research Center.

[ MBD+99 ] Mucci, PJ、Browne, S.、Deane, C. 和 Ho, G.“PAPI:硬件性能计数器的便携式接口”。在过程中。国防部 HPCMP 用户组会议,加利福尼亚州蒙特雷,1999 年 6 月。

[MBD+99] Mucci, P. J., Browne, S., Deane, C., and Ho, G. “PAPI: A Portable Interface to Hardware Performance Counters.” In Proc. Department of Defense HPCMP Users Group Conference, Monterey, CA, June 1999.

[ RR99 ] Rao, J. 和 Ross, KA “主内存中决策支持的缓存意识索引。” 在过程中。VLDB,1999。

[RR99] Rao, J. and Ross, K. A. “Cache Conscious Indexing for Decision-Support in Main Memory.” In Proc. VLDB, 1999.

[ RR00 ] Rao, J. 和 Ross, KA “使 B+ 树在主内存中具有缓存意识。” 见SIGMOD Record,29(2):475-486,2000年 6 月。

[RR00] Rao, J. and Ross, K. A. “Making B+-trees cache conscious in main memory.” In SIGMOD Record, 29(2):475-486, June 2000.

[ SMK+01 ] Stoica, I.、Morris, R.、Karger, DR、Kaashoek, MF 和 Balakrishnan, H.“Chord:用于互联网应用的可扩展点对点查找协议”。在过程中。信号通讯,2001。

[SMK+01] Stoica, I., Morris, R., Karger, D. R., Kaashoek, M. F., and Balakrishnan, H. “Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications.” In Proc. SIGCOMM, 2001.

[ SAB+05 ] Stonebraker, M.、Abadi, DJ、Batkin, A.、Chen, X.、Cherniack, M.、Ferreira, M.、Lau, E.、Lin, A.、Madden, S.、O 'Neil, E.、O'Neil, P.、Rasin, A.、Tran, N. 和 Zdonik, S.“C-Store:面向列的 DBMS”。在过程中。VLDB,2005。

[SAB+05] Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E., O’Neil, P., Rasin, A., Tran, N., and Zdonik, S. “C-Store: A Column-oriented DBMS.” In Proc. VLDB, 2005.

[ SMA+07 ] Stonebraker, M.、Madden, S.、Abadi, DJ、Harizopoulos, S.、Hachem, N. 和 Helland, P.“建筑时代的终结(是时候彻底重写了)。 ” 在过程中。VLDB,2007。

[SMA+07] Stonebraker, M., Madden, S., Abadi, D. J., Harizopoulos, S., Hachem, N., and Helland, P. “The End of an Architectural Era (It’s Time for a Complete Rewrite).” In Proc. VLDB, 2007.

[ Tim07 ] Oracle TimesTen。http://www.oracle.com/timesten/index.html。2007年。

[Tim07] Oracle TimesTen. http://www.oracle.com/timesten/index.html. 2007.

[ TPCC ] 交易处理委员会。TPC-C 基准(修订版 5.8.0),2006 年。http://www.tpc.org/tpcc/spec/tpcc_current.pdf

[TPCC] The Transaction Processing Council. TPC-C Benchmark (Rev. 5.8.0), 2006. http://www.tpc.org/tpcc/spec/tpcc_current.pdf

[ WSA97 ] Whitney, A.、Shasha, D. 和 Apter, S.“没有并发控制、两阶段提交、SQL 或 C 的大容量事务处理。” 在过程中。HPTPS,1997。

[WSA97] Whitney, A., Shasha, D., and Apter, S. “High Volume Transaction Processing Without Concurrency Control, Two Phase Commit, SQL or C.” In Proc. HPTPS, 1997.

允许免费制作本作品全部或部分内容的数字或硬拷贝以供个人或课堂使用,前提是制作或分发副本不是为了盈利或商业利益,并且副本在首页上附有此通知和完整引用。要以其他方式复制、重新发布、发布到服务器上或重新分发到列表,需要事先获得特定许可和/或付费。

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

SIGMOD'08,2008年 6 月 9 日至 12 日,加拿大不列颠哥伦比亚省温哥华。

SIGMOD’08, June 9–12, 2008, Vancouver, BC, Canada.

版权所有 2008 ACM 978-1-60558-102-6/08/06 … 5.00 美元。

Copyright 2008 ACM 978-1-60558-102-6/08/06 … $5.00.

原始 DOI:10.1145/1376616.1376713

Original DOI: 10.1145/1376616.1376713

1 . http://www.cs.wisc.edu/shore/

1. http://www.cs.wisc.edu/shore/

2 . http://db.cs.yale.edu/hstore/

2. http://db.cs.yale.edu/hstore/

“一刀切”:一个时代已过的想法

“One Size Fits All”: An Idea Whose Time Has Come and Gone

Michael Stonebraker麻省理工学院 CSAIL 和 StreamBase Systems, Inc.

U ǧ ur Çetintemel布朗大学和 StreamBase Systems, Inc.

Michael Stonebraker (MIT CSAIL and StreamBase Systems, Inc)

Uǧur Çetintemel (Brown University and StreamBase Systems, Inc.)

抽象的

Abstract

过去 25 年的商业 DBMS 开发可以用一句话来概括:“一刀切”。该短语指的是这样一个事实:传统的 DBMS 架构(最初是为业务数据处理而设计和优化的)已用于支持许多具有广泛不同特征和要求的以数据为中心的应用程序。

The last 25 years of commercial DBMS development can be summed up in a single phrase: “One size fits all”. This phrase refers to the fact that the traditional DBMS architecture (originally designed and optimized for business data processing) has been used to support many data-centric applications with widely varying characteristics and requirements.

在本文中,我们认为这个概念不再适用于数据库市场,并且商业世界将分裂成独立数据库引擎的集合,其中一些可能由公共前端解析器统一。我们使用流处理市场和数据仓库市场的例子来支持我们的主张。我们还简要讨论了传统架构不适合的其他市场,并主张对当前将系统服务分解为产品的方式进行批判性的重新思考。

In this paper, we argue that this concept is no longer applicable to the database market, and that the commercial world will fracture into a collection of independent database engines, some of which may be unified by a common frontend parser. We use examples from the stream-processing market and the data-warehouse market to bolster our claims. We also briefly discuss other markets for which the traditional architecture is a poor fit and argue for a critical rethinking of the current factoring of systems services into products.

1  简介

1  Introduction

关系 DBMS 在 1970 年代以 System R [ 10 ] 和 INGRES [ 27 ]的形式作为研究原型出现。两种原型机的主旨其目标是在 IMS 所使用的应用程序(即“业务数据处理”)上为客户提供的价值超越 IMS。因此,这两个系统都是针对在线事务处理 (OLTP) 应用程序而设计的,并且它们的商业对应系统(分别是 DB2 和 INGRES)在 1980 年代在这一领域得到了认可。其他供应商(例如,Sybase、Oracle 和 Informix)遵循相同的基本 DBMS 模型,该模型逐行存储关系表,使用 B 树进行索引,使用基于成本的优化器,并提供 ACID 事务属性。

Relational DBMSs arrived on the scene as research prototypes in the 1970’s, in the form of System R [10] and INGRES [27]. The main thrust of both prototypes was to surpass IMS in value to customers on the applications that IMS was used for, namely “business data processing”. Hence, both systems were architected for on-line transaction processing (OLTP) applications, and their commercial counterparts (i.e., DB2 and INGRES, respectively) found acceptance in this arena in the 1980’s. Other vendors (e.g., Sybase, Oracle, and Informix) followed the same basic DBMS model, which stores relational tables row-by-row, uses B-trees for indexing, uses a cost-based optimizer, and provides ACID transaction properties.

自 1980 年代初期以来,主要 DBMS 供应商一直坚定地坚持“一刀切”策略,即他们为所有 DBMS 服务维护单一代码行。这种选择的原因很简单——使用多行代码会导致各种实际问题,包括:

Since the early 1980’s, the major DBMS vendors have steadfastly stuck to a “one size fits all” strategy, whereby they maintain a single code line with all DBMS services. The reasons for this choice are straightforward—the use of multiple code lines causes various practical problems, including:

•  成本问题,因为维护成本至少随代码行数线性增加;

•  a cost problem, because maintenance costs increase at least linearly with the number of code lines;

•  兼容性问题,因为所有应用程序都必须针对每个代码行运行;

•  a compatibility problem, because all applications have to run against every code line;

•  销售问题,因为销售人员对于尝试向客户销售哪种产品感到困惑;和

•  a sales problem, because salespeople get confused about which product to try to sell to a customer; and

•  营销问题,因为需要在市场中正确定位多个代码行。

•  a marketing problem, because multiple code lines need to be positioned correctly in the marketplace.

为了避免这些问题,所有主要的 DBMS 供应商都遵循“把所有木材放在一个箭头后面”这句格言。在本文中,我们认为这一策略已经失败,并且在未来将会更加失败。

To avoid these problems, all the major DBMS vendors have followed the adage “put all wood behind one arrowhead”. In this paper we argue that this strategy has failed already, and will fail more dramatically off into the future.

本文的其余部分的结构如下。在第 2 节中,我们通过引用数据仓库市场的一些关键特征来简要说明单代码行策略已经失败的原因。在第 3 节中,我们讨论流处理应用程序,并指出一个特定的示例,其中专用的流处理引擎的性能比 RDBMS 好两个数量级。第 4 节然后转向性能差异的原因,并指出 DBMS 技术不太可能适应该市场的竞争。因此,我们预计流处理引擎将在市场上蓬勃发展。在第 5 节中,我们讨论了一系列其他市场,其中一种尺寸不可能适合所有市场,而其他专用数据库系统可能是可行的。因此,DBMS 市场的碎片化可能相当广泛。在第 6 节中,我们提供了一些有关将系统软件分解为产品的评论。最后,我们在第 7 节中给出一些结论性意见来结束本文。

The rest of the paper is structured as follows. In Section 2, we briefly indicate why the single code-line strategy has failed already by citing some of the key characteristics of the data warehouse market. In Section 3, we discuss stream processing applications and indicate a particular example where a specialized stream processing engine outperforms an RDBMS by two orders of magnitude. Section 4 then turns to the reasons for the performance difference, and indicates that DBMS technology is not likely to be able to adapt to be competitive in this market. Hence, we expect stream processing engines to thrive in the marketplace. In Section 5, we discuss a collection of other markets where one size is not likely to fit all, and other specialized database systems may be feasible. Hence, the fragmentation of the DBMS market may be fairly extensive. In Section 6, we offer some comments about the factoring of system software into products. Finally, we close the paper with some concluding remarks in Section 7.

2  数据仓库

2  Data Warehousing

20 世纪 90 年代初,出现了一种新趋势:企业希望将多个运营数据库中的数据收集到一个数据仓库中,以实现商业智能目的。典型的大型企业拥有 50 个左右的操作系统,每个系统都有一个期望快速响应时间的在线用户社区。系统管理员过去(现在仍然)不愿意允许商业智能用户进入同一系统,担心这些用户复杂的临时查询会降低在线社区的响应时间。此外,商业智能用户通常希望查看历史趋势,以及关联多个操作数据库中的数据。这些功能与在线用户所需的功能有很大不同。

In the early 1990’s, a new trend appeared: Enterprises wanted to gather together data from multiple operational databases into a data warehouse for business intelligence purposes. A typical large enterprise has 50 or so operational systems, each with an on-line user community who expect fast response time. System administrators were (and still are) reluctant to allow business-intelligence users onto the same systems, fearing that the complex ad-hoc queries from these users will degrade response time for the on-line community. In addition, business-intelligence users often want to see historical trends, as well as correlate data from multiple operational databases. These features are very different from those required by on-line users.

由于这些原因,基本上每个企业都创建了一个大型数据仓库,并定期将数据从操作系统“抓取”到其中。然后,商业智能用户可以对仓库中的数据运行复杂的临时查询,而不会影响在线用户。尽管大多数仓库项目都大大超出了预算,并且最终只提供了承诺功能的子集,但它们仍然提供了合理的投资回报。事实上,人们普遍认为,零售交易的历史仓库会在一年内收回成本,这主要是由于更明智的库存周转和购买决策的结果。例如,商业智能用户可以发现宠物石头已过时而芭比娃娃已流行,然后做出适当的商品放置和购买决策。

For these reasons, essentially every enterprise created a large data warehouse, and periodically “scraped” the data from operational systems into it. Business-intelligence users could then run their complex ad-hoc queries against the data in the warehouse, without affecting the on-line users. Although most warehouse projects were dramatically over budget and ended up delivering only a subset of promised functionality, they still delivered a reasonable return on investment. In fact, it is widely acknowledged that historical warehouses of retail transactions pay for themselves within a year, primarily as a result of more informed stock rotation and buying decisions. For example, a business-intelligence user can discover that pet rocks are out and Barbie dolls are in, and then make appropriate merchandise placement and buying decisions.

数据仓库与 OLTP 系统有很大不同。OLTP 系统已针对更新进行了优化,因为主要业务活动通常是销售商品或服务。相比之下,数据仓库中的主要活动是即席查询,通常非常复杂。因此,典型的仓库所经历的就是定期加载新数据并穿插着临时查询活动。

Data warehouses are very different from OLTP systems. OLTP systems have been optimized for updates, as the main business activity is typically to sell a good or service. In contrast, the main activity in data warehouses is ad-hoc queries, which are often quite complex. Hence, periodic load of new data interspersed with ad-hoc query activity is what a typical warehouse experiences.

数据仓库模式中的标准智慧是创建一个事实表,其中包含有关每个操作事务的“谁、什么、何时、何地”。例如,图 1显示了典型零售商的架构。请注意中央事实表,其中包含由连锁店中每个商店的收银员扫描的每个商品的条目。此外,仓库还包含维度表,其中包含每个商店、每个客户、每个产品和每个时间段的信息。实际上,事实表包含每个维度的外键,星型模式是自然的结果。这种星型模式在仓库环境中无处不在,但在 OLTP 环境中几乎不存在。

The standard wisdom in data warehouse schemas is to create a fact table, containing the “who, what, when, where” about each operational transaction. For example, Figure 1 shows the schema for a typical retailer. Note the central fact table, which holds an entry for each item that is scanned by a cashier in each store in its chain. In addition, the warehouse contains dimension tables, with information on each store, each customer, each product, and each time period. In effect, the fact table contains a foreign key for each of these dimensions, and a star schema is the natural result. Such star schemas are omnipresent in warehouse environments, but are virtually nonexistent in OLTP environments.

众所周知,仓库应用程序使用位图索引运行得更好,而 OLTP 用户更喜欢 B 树索引。原因很简单:位图索引在仓库工作负载上速度更快、更紧凑,但在 OLTP 环境中却无法正常工作。因此,许多供应商在其 DBMS 产品中同时支持 B 树索引和位图索引。

It is a well known homily that warehouse applications run much better using bit-map indexes while OLTP users prefer B-tree indexes. The reasons are straight-forward: bit-map indexes are faster and more compact on warehouse workloads, while failing to work well in OLTP environments. As a result, many vendors support both B-tree indexes and bit-map indexes in their DBMS products.

图像

图1  典型的星型模型

Figure 1  A typical star schema

此外,物化视图在仓库领域是一种有用的优化策略,但在 OLTP 领域却不是。相比之下,普通(“虚拟”)视图在 OLTP 环境中得到接受。

In addition, materialized views are a useful optimization tactic in warehouse worlds, but never in OLTP worlds. In contrast, normal (“virtual”) views find acceptance in OLTP environments.

初步估计,大多数供应商都有一个仓库 DBMS(位图索引、物化视图、星型模式和星型模式查询的优化器策略)和一个 OLTP DBMS(B 树索引和标准的基于成本的优化器),它们是由通用解析器联合起来,如图2所示。

To a first approximation, most vendors have a warehouse DBMS (bit-map indexes, materialized views, star schemas and optimizer tactics for star schema queries) and an OLTP DBMS (B-tree indexes and a standard cost-based optimizer), which are united by a common parser, as illustrated in Figure 2.

尽管这种配置允许这样的供应商将其 DBMS 产品作为单一系统进行营销,但由于单一用户界面,实际上她正在销售多个系统。此外,对于在另一个世界中没有用的功能,OLTP 和仓库市场都会面临相当大的压力。例如,OLTP 数据库中的常见做法是将地址的州(在美国)部分表示为两字节字符串。相比之下,很明显可以将 50 个状态编码为 6 位。如果有足够的查询和足够的数据来证明对状态字段进行编码的成本是合理的,那么后面的表示是有利的。这在仓库中通常是这样,但在 OLTP 中却不是这样。因此,详细的字段编码将是一个在 OLTP 中几乎没有用处的仓库功能。图2 .

Although this configuration allows such a vendor to market his DBMS product as a single system, because of the single user interface, in effect, she is selling multiple systems. Moreover, there will considerable pressure from both the OLTP and warehouse markets for features that are of no use in the other world. For example, it is common practice in OLTP databases to represent the state (in the United States) portion of an address as a two-byte character string. In contrast, it is obvious that 50 states can be coded into six bits. If there are enough queries and enough data to justify the cost of coding the state field, then the later representation is advantageous. This is usually true in warehouses and never true in OLTP. Hence, elaborate coding of fields will be a warehouse feature that has little or no utility in OLTP. The inclusion of additional market-specific features will make commercial products look increasingly like the architecture illustrated in Figure 2.

图像

图2  当前DBMS的架构

Figure 2  The architecture of current DBMSs

由于具有共同的用户界面,“一刀切”的错觉可以保留为图 2中两个不同系统的营销虚构。在我们现在转向的流处理市场中,这种通用前端是不切实际的。因此,不仅会有不同的发动机,而且会有不同的前端。“一刀切”的营销幻想在这个世界上是行不通的。

The illusion of “one size fits all” can be preserved as a marketing fiction for the two different systems of Figure 2, because of the common user interface. In the stream processing market, to which we now turn, such a common front end is impractical. Hence, not only will there be different engines but also different front ends. The marketing fiction of “one size fits all” will not fly in this world.

3  流处理

3  Stream Processing

最近,研究界对流处理应用产生了相当大兴趣[ 7,13,14,20 ] 这种兴趣是由未来几年传感器网络即将出现的商业可行性所激发的。虽然RFID最近受到了媒体的广泛关注,并将在涉及供应链优化的零售应用中得到广泛接受,但还有许多其他技术(例如,Lojack [3] 。许多行业专家看到了监控应用的“绿色领域”,低成本传感器设备网络引起的“巨变”将实现这一领域。

Recently, there has been considerable interest in the research community in stream processing applications [7, 13, 14, 20]. This interest is motivated by the upcoming commercial viability of sensor networks over the next few years. Although RFID has gotten all the press recently and will find widespread acceptance in retail applications dealing with supply chain optimization, there are many other technologies as well (e.g., Lojack [3]). Many industry pundits see a “green field” of monitoring applications that will be enabled by this “sea change” caused by networks of low-cost sensor devices.

3.1  新兴的基于传感器的应用

3.1  Emerging Sensor-based Applications

传感器网络技术在军事领域有明显的应用。例如,美国陆军正在研究在所有士兵身上安装生命体征监测器,以便他们能够在战斗情况下优化医疗分流。此外,许多军用车辆上已经安装了GPS系统,但尚未连接成闭环系统。相反,军队希望监控所有车辆的位置并实时确定它们是否偏离路线。此外,他们还会就像炮塔上的传感器一样;与位置一起,这将有助于检测交火情况。油表上的传感器将允许优化加油。总之,一个由 30,000 人和 12,000 辆车辆组成的陆军营很快将成为一个由数十万个节点组成的大规模传感器网络,实时提供状态和位置信息。

There are obvious applications of sensor network technology in the military domain. For example, the US Army is investigating putting vital-signs monitors on all soldiers, so that they can optimize medical triage in combat situations. In addition, there is already a GPS system in many military vehicles, but it is not connected yet into a closed-loop system. Instead, the army would like to monitor the position of all vehicles and determine, in real time, if they are off course. Additionally, they would like a sensor on the gun turret; together with location, this will allow the detection of crossfire situations. A sensor on the gas gauge will allow the optimization of refueling. In all, an army battalion of 30,000 humans and 12,000 vehicles will soon be a large-scale sensor network of several hundred thousand nodes delivering state and position information in real time.

网络中的处理节点和下游服务器必须能够处理这种数据“消防水带”。所需的操作包括复杂的警报,例如排长希望知道他的四辆车中的三辆何时穿过前线。还需要历史查询,例如“过去两个小时 12 号车辆在哪里?” 最后,需求包括纵向查询,例如“部队目前的总体战备状态如何?”

Processing nodes in the network and downstream servers must be capable of dealing with this “firehose” of data. Required operations include sophisticated alerting, such as the platoon commander wishes to know when three of his four vehicles cross the front line. Also required are historical queries, such as “Where has vehicle 12 been for the last two hours?” Lastly, requirements encompass longitudinal queries, such as “What is the overall state of readiness of the force right now?”

随着时间的推移,其他基于传感器的监控应用也将出现在许多非军事应用中。监控交通拥堵并建议替代出行路线就是一个例子。一个相关的应用是高速公路系统上基于拥堵的可变收费,这是线性道路基准 [ 9 ]背后的灵感。游乐园很快将把顾客身上的无源腕带变成有源传感器,以便优化游乐设施并定位走失的儿童。手机已经是一种活跃的设备,人们可以很容易地想象一种服务,可以找到距离饥饿的顾客最近的餐馆。甚至图书馆的书籍也会被贴上传感器标签,因为如果书架放错了,它可能会永远丢失在大图书馆里。

Other sensor-based monitoring applications will also come over time in many non-military applications. Monitoring traffic congestion and suggesting alternate travel routes is one example. A related application is variable, congestion-based tolling on highway systems, which was the inspiration behind the Linear Road benchmark [9]. Amusement parks will soon turn passive wristbands on customers into active sensors, so that rides can be optimized and lost children located. Cell phones are already active devices, and one can easily imagine a service whereby the closest restaurant to a hungry customer can be located. Even library books will be sensor tagged, because if one is mis-shelved, it may be lost forever in a big library.

人们普遍猜测传统的 DBMS 在这种新型监控应用程序中表现不佳。事实上,在 Linear Road 上,传统解决方案比专用流处理引擎慢了近一个数量级 [ 9 ]。对当前流数据应用领域的检查也证实了传统 DBMS 技术不适用于流应用程序。我们现在讨论我们在此类应用程序(金融信息流处理)方面的经验。

There is widespread speculation that conventional DBMSs will not perform well on this new class of monitoring applications. In fact, on Linear Road, traditional solutions are nearly an order of magnitude slower than a special purpose stream processing engine [9]. The inapplicability of the traditional DBMS technology to streaming applications is also bolstered by an examination of the current application areas with streaming data. We now discuss our experience with such an application, financial-feed processing.

3.2  现有应用程序:金融信息流处理

3.2  An Existing Application: Financial-Feed Processing

大多数大型金融机构都会订阅提供市场活动实时数据的源,特别是新闻、完成的交易、出价和要价等。路透社、彭博社和 Infodyne 是提供此类源的供应商的例子。金融机构有多种处理此类数据的应用程序。其中包括生成实时业务分析的系统、执行电子交易的系统、确保所有交易合法遵守各种公司和 SEC 规则的系统,以及计算实时风险和市场风险的系统。外汇汇率的波动。用于实现此类应用程序的技术总是“自行开发”,因为应用程序专家在现成的系统软件产品方面运气不佳。

Most large financial institutions subscribe to feeds that deliver real-time data on market activity, specifically news, consummated trades, bids and asks, etc. Reuters, Bloomberg and Infodyne are examples of vendors that deliver such feeds. Financial institutions have a variety of applications that process such feeds. These include systems that produce real-time business analytics, ones that perform electronic trading, ones that ensure legal compliance of all trades to the various company and SEC rules, and ones that compute real-time risk and market exposure to fluctuations in foreign exchange rates. The technology used to implement this class of applications is invariably “roll your own”, because application experts have not had good luck with off-the-shelf system software products.

为了更深入地探讨饲料加工问题,我们现在详细描述由一家大型共同基金公司指定的特定原型应用。该公司订阅了多个商业源,并且拥有一个当前的生产应用程序,可以监视所有源是否存在最新数据。其想法是,如果其中一个商业源出现延迟,则提醒交易者,以便交易者知道不要相信该源提供的信息。该公司对其“自行开发”解决方案的性能和灵活性不满意,并要求使用流处理引擎进行试点。

In order to explore feed processing issues more deeply, we now describe in detail a specific prototype application, which was specified by a large mutual fund company. This company subscribes to several commercial feeds, and has a current production application that watches all feeds for the presence of late data. The idea is to alert the traders if one of the commercial feeds is delayed, so that the traders can know not to trust the information provided by that feed. This company is unhappy with the performance and flexibility of their “roll your own” solution and requested a pilot using a stream processing engine.

该公司工程师指定了当前应用程序的简化版本,以探索当前系统和流处理引擎之间的性能差异。根据他们的规范,他们正在为其应用程序的一个子集寻找单个 PC 级机器上的最大消息处理吞吐量,该子集由报告来自两个交换机的数据的两个提要组成。

The company engineers specified a simplified version of their current application to explore the performance differences between their current system and a stream processing engine. According to their specification, they were looking for maximum message processing throughput on a single PC-class machine for a subset of their application, which consisted of two feeds reporting data from two exchanges.

具体来说,有 4500 种证券,其中 500 种是“快速变动”的。如果其中一种证券的股票价格变动发生在同一证券的前一个价格变动之后超过五秒,则该股票价格变动为延迟。其他 4000 个符号移动缓慢,如果自上一个刻度以来已经过去 60 秒,则该刻度已晚。

Specifically, there are 4500 securities, 500 of which are “fast moving”. A stock tick on one of these securities is late if it occurs more than five seconds after the previous tick from the same security. The other 4000 symbols are slow moving, and a tick is late if 60 seconds have elapsed since the previous tick.

有两个 Feed 提供商,该公司希望在每次来自任一提供商的延迟报价时收到警报消息。此外,他们希望为每个提供商保留一个柜台。当从任一提供商处收到 100 个延迟报价时,他们希望收到一条特殊的“这真的很糟糕”消息,然后抑制后续的单个报价报告

There are two feed providers and the company wished to receive an alert message each time there is a late tick from either provider. In addition, they wished to maintain a counter for each provider. When 100 late ticks have been received from either provider, they wished to receive a special “this is really bad” message and then to suppress the subsequent individual tick reports

该公司规范中的最后一个问题是,他们希望从纽约证券交易所和纳斯达克等两家交易所中积累最新的报价,无论哪个供应商提供了最新的数据。如果通过任一 Feed 供应商从任一交易所收到 100 条迟到的消息,他们希望收到两条额外的特殊消息。总之,他们需要四个计数器,每个计数器计数到 100,并产生一个特殊消息。该任务的查询图的抽象表示如图 3所示。

The last wrinkle in the company’s specification was that they wished to accumulate late ticks from each of two exchanges, say NYSE and NASD, regardless of which feed vendor produced the late data. If 100 late messages were received from either exchange through either feed vendor, they wished to receive two additional special messages. In summary, they want four counters, each counting to 100, with a resulting special message. An abstract representation of the query diagram for this task is shown in Figure 3.

尽管此原型应用程序只是实际生产系统中使用的应用程序逻辑的一个子集,但它代表了一个易于指定的任务,可以轻松测量其性能;因此,它是一个具有代表性的例子。现在我们来看看这个示例应用程序在流处理引擎和 RDBMS 上的速度。

Although this prototype application is only a subset of the application logic used in the real production system, it represents a simple-to-specify task on which performance can be readily measured; as such, it is a representative example. We now turn to the speed of this example application on a stream processing engine as well as an RDBMS.

图像

图 3   StreamBase 中的 Feed Alarm 应用程序

Figure 3  The Feed Alarm application in StreamBase

4  性能讨论

4  Performance Discussion

上一节中讨论的示例应用程序是在 StreamBase 流处理引擎 (SPE) [ 5 ] 中实现的,它基本上是 Aurora [ 8 , 13 ]的商业、工业强度版本。在具有 512 MB 内存和单个 SCSI 磁盘的 2.8Ghz Pentium 处理器上,在观察到 CPU 饱和之前,图 3中的工作流可以每秒 160,000 条消息的速度执行。相比之下,StreamBase 工程师每秒只能从使用流行的商业关系 DBMS 的同一应用程序的实现中获取 900 条消息。

The example application discussed in the previous section was implemented in the StreamBase stream processing engine (SPE) [5], which is basically a commercial, industrial-strength version of Aurora [8, 13]. On a 2.8Ghz Pentium processor with 512 Mbytes of memory and a single SCSI disk, the workflow in Figure 3 can be executed at 160,000 messages per second, before CPU saturation is observed. In contrast, StreamBase engineers could only coax 900 messages per second from an implementation of the same application using a popular commercial relational DBMS.

在本节中,我们将讨论导致观察到的性能出现两个数量级差异的主要原因。正如我们下面讨论的,原因与入站处理模型、流处理的正确原语以及 DBMS 处理与应用程序处理的无缝集成有关。此外,我们还考虑交易行为,这通常是另一个主要考虑因素。

In this section, we discuss the main reasons that result in the two orders of magnitude difference in observed performance. As we argue below, the reasons have to do with the inbound processing model, correct primitives for stream processing, and seamless integration of DBMS processing with application processing. In addition, we also consider transactional behavior, which is often another major consideration.

图像

图4   “出站”处理

Figure 4  “Outbound” processing

4.1   “入站”与“出站”处理

4.1  “Inbound” versus “Outbound” Processing

我们所说的“出站”处理从根本上内置于 DBMS 模型中,如图 4所示。具体来说,第一步是将数据插入数据库(步骤 1)。对数据建立索引并提交事务后,该数据可用于后续查询处理(步骤 2),然后将结果呈现给用户(步骤 3)。这种“先处理后存储”的模型是所有传统 DBMS 的核心,这并不奇怪,因为毕竟 DBMS 的主要功能是接受数据,然后永不丢失数据。

Built fundamentally into the DBMS model of the world is what we term “outbound” processing, illustrated in Figure 4. Specifically, one inserts data into a database as a first step (step 1). After indexing the data and committing the transaction, that data is available for subsequent query processing (step 2) after which results are presented to the user (step 3). This model of “process-after-store” is at the heart of all conventional DBMSs, which is hardly surprising because, after all, the main function of a DBMS is to accept and then never lose data.

在实时应用程序中,必须在处理之前发生的存储操作显着增加了应用程序中的延迟(即等待时间)以及应用程序的每条消息的处理成本。图 5以图形方式显示了避免此存储瓶颈的替代处理模型。在这里,输入流被推送到系统(步骤 1),并在查询网络“飞过”内存时得到处理(步骤 2)。然后将结果推送到客户端应用程序以供使用(步骤 3)。对存储的读取或写入是可选的,并且在许多情况下(如果存在)可以异步执行。事实上,不存在存储或可选存储可以节省成本和延迟,从而显着提高性能。这种模型称为“入站”处理,是 StreamBase 等流处理引擎所采用的模型。

In real-time applications, the storage operation, which must occur before processing, adds significantly both to the delay (i.e., latency) in the application, as well as to the processing cost per message of the application. An alternative processing model that avoids this storage bottleneck is shown graphically in Figure 5. Here, input streams are pushed to the system (step 1) and get processed (step 2) as they “fly by” in memory by the query network. The results are then pushed to the client application(s) for consumption (step 3). Reads or writes to storage are optional and can be executed asynchronously in many cases, when they are present. The fact that storage is absent or optional saves both on cost and latency, resulting in significantly higher performance. This model, called “inbound” processing, is what is employed by a stream processing engine such as StreamBase.

当然,其中一个问题是“DBMS 可以进行入站处理吗?” DBMS 最初被设计为出站处理引擎,但多年后才将触发器移植到其引擎上。触发器有很多限制(例如每个表允许的数量)并且无法确保触发器安全(即确保触发器不会进入无限循环)。总体而言,触发器的编程支持很少或根本没有。例如,无法查看应用程序中存在哪些触发器,也无法通过图形用户界面将触发器添加到表中。此外,虚拟视图和物化视图是为常规表提供的,但不为触发器提供。最后,触发器在现有引擎中经常存在性能问题。当 StreamBase 工程师尝试将它们用于 Feed 警报应用时,他们每秒仍然无法获取超过 900 条消息。总之,触发器是作为事后考虑纳入现有设计的,因此在当前系统中是二等公民。

One is, of course, led to ask “Can a DBMS do inbound processing?” DBMSs were originally designed as outbound processing engines, but grafted triggers onto their engines as an afterthought many years later. There are many restrictions on triggers (e.g., the number allowed per table) and no way to ensure trigger safety (i.e., ensuring that triggers do not go into an infinite loop). Overall, there is very little or no programming support for triggers. For example, there is no way to see what triggers are in place in an application, and no way to add a trigger to a table through a graphical user interface. Moreover, virtual views and materialized views are provided for regular tables, but not for triggers. Lastly, triggers often have performance problems in existing engines. When StreamBase engineers tried to use them for the feed alarm application, they still could not obtain more than 900 messages per second. In summary, triggers are incorporated to the existing designs as an afterthought and are thus second-class citizens in current systems.

图像

图 5   “入站”处理

Figure 5  “Inbound” processing

因此,关系 DBMS 是出站引擎,有限的入站处理已移植到其上。相比之下,流处理引擎(例如 Aurora 和 StreamBase)本质上是入站处理引擎。从头开始,入站引擎与出站引擎看起来完全不同。例如,出站引擎使用“拉”处理模型,即提交查询,引擎的工作就是有效地将记录从存储中拉出以满足查询。相比之下,入站引擎使用“推送”处理模型,引擎的工作是通过应用程序中所需的处理步骤有效地推送传入消息。

As such, relational DBMSs are outbound engines onto which limited inbound processing has been grafted. In contrast, stream processing engines, such as Aurora and StreamBase are fundamentally inbound processing engines. From the ground up, an inbound engine looks radically different from an outbound engine. For example, an outbound engine uses a “pull” model of processing, i.e., a query is submitted and it is the job of the engine to efficiently pull records out of storage to satisfy the query. In contrast, an inbound engine uses a “push” model of processing, and it is the job of the engine to efficiently push incoming messages through the processing steps entailed in the application.

查看区别的另一种方法是出站引擎存储数据,然后针对数据执行查询。相反,入站引擎存储查询,然后通过查询传递传入数据(消息)。

Another way to view the distinction is that an outbound engine stores the data and then executes the queries against the data. In contrast, an inbound engine stores the queries and then passes the incoming data (messages) through the queries.

尽管构建一个入站或出站引擎似乎是可以想象的,但这样的设计显然是一个研究项目。同时,DBMS 针对出站处理进行了优化,流处理引擎针对入站处理进行了优化。在进料警报应用中,这种原理上的差异占了所观察到的性能差异的很大一部分。

Although it seems conceivable to construct an engine that is either an inbound or an outbound engine, such a design is clearly a research project. In the meantime, DBMSs are optimized for outbound processing, and stream processing engines for inbound processing. In the feed alarm application, this difference in philosophy accounts for a substantial portion of the performance difference observed.

4.2  正确的原语

4.2  The Correct Primitives

SQL 系统包含一个复杂的聚合系统,用户可以通过该系统对数据库中的表中的记录分组运行统计计算。标准示例是:

SQL systems contain a sophisticated aggregation system, whereby a user can run a statistical computation over groupings of the records from a table in a database. The standard example is:

选择平均(工资)

Select avg (salary)

来自员工

From employee

按部门分组

Group by department

当执行引擎处理表中的最后一条记录时,它可以发出每组记录的聚合计算。然而,这种构造在流应用程序中几乎没有什么好处,在流应用程序中,流会永远持续下去,并且没有“表结束”的概念。

When the execution engine processes the last record in the table, it can emit the aggregate calculation for each group of records. However, this construct is of little benefit in streaming applications, where streams continue forever and there is no notion of “end of table”.

因此,流处理引擎使用时间窗口的概念扩展了 SQL(或其他聚合语言)。在 StreamBase 中,可以根据时钟时间、消息数量或某些其他属性中的断点来定义窗口。在 feed 报警应用中,每个流中最左边的框就是这样一个聚合框。聚合按代码对股票进行分组,然后将每只股票的窗口定义为刻度 1 和 2、2 和 3、3 和 4 等。这种“滑动窗口”在实时应用程序中通常非常有用。

Consequently, stream processing engines extend SQL (or some other aggregation language) with the notion of time windows. In StreamBase, windows can be defined based on clock time, number of messages, or breakpoints in some other attribute. In the feed alarm application, the leftmost box in each stream is such an aggregate box. The aggregate groups stocks by symbol and then defines windows to be ticks 1 and 2, 2 and 3, 3 and 4, etc. for each stock. Such “sliding windows” are often very useful in real-time applications.

此外,StreamBase 聚合的构建可以智能地处理延迟、无序或丢失的消息。在饲料警报应用中,客户从根本上对寻找最新数据感兴趣。StreamBase 允许窗口上的聚合具有两个附加参数。第一个是超时参数,它指示 StreamBase 执行引擎关闭窗口并发出一个值,即使关闭窗口的条件尚未满足。该参数有效地处理迟到或丢失的元组。第二个参数是松弛,这是一个指令,指示执行引擎在满足关闭条件后保持窗口打开。该参数解决元组到达时的混乱问题。这两个参数允许用户指定如何处理流异常,可以有效地利用这两个参数来提高系统的弹性。

In addition, StreamBase aggregates have been constructed to deal intelligently with messages which are late, out-of-order, or missing. In the feed alarm application, the customer is fundamentally interested in looking for late data. StreamBase allows aggregates on windows to have two additional parameters. The first is a timeout parameter, which instructs the StreamBase execution engine to close a window and emit a value even if the condition for closing the window has not been satisfied. This parameter effectively deals with late or missing tuples. The second parameter is slack, which is a directive to the execution engine to keep a window open, after its closing condition has been satisfied. This parameter addresses disorder in tuple arrivals. These two parameters allow the user to specify how to deal with stream abnormalities and can be effectively utilized to improve system resilience.

在进给警报应用程序中,每个窗口有两个刻度,但有 5 或 60 秒的超时。如果连续滴答之间的到达间隔时间超过用户定义的最大值,这将导致窗口关闭。这是发现晚期数据的非常有效的方法;即,作为高度调整的聚合逻辑的副作用。在示例应用程序中,每个聚合后面的框会丢弃有效数据并仅保留超时消息。应用程序的其余部分对这些超时执行必要的簿记。

In the feed alarm application each window is two ticks, but has a timeout of either 5 or 60 seconds. This will cause windows to be closed if the inter-arrival time between successive ticks exceeds the maximum defined by the user. This is a very efficient way to discover late data; i.e., as a side effect of the highly-tuned aggregate logic. In the example application, the box after each aggregate discards the valid data and keeps only the timeout messages. The remainder of the application performs the necessary bookkeeping on these timeouts.

在系统的较低层拥有正确的原语可以实现非常高的性能。相反,关系引擎不包含这样的内置构造。使用传统 SQL 模拟它们的效果非常乏味,并且会导致性能上的第二个显着差异。

Having the right primitives at the lower layers of the system enables very high performance. In contrast, a relational engine contains no such built-in constructs. Simulating their effect with conventional SQL is quite tedious, and results in a second significant difference in performance.

可以向 SQL 添加时间窗口,但这些对存储的数据没有任何意义。因此,窗口构造必须集成到某种入站处理模型中。

It is possible to add time windows to SQL, but these make no sense on stored data. Hence, window constructs would have to be integrated into some sort of an inbound processing model.

4.3   DBMS处理和应用逻辑的无缝集成

4.3  Seamless Integration of DBMS Processing and Application Logic

关系 DBMS 都被设计为具有客户端-服务器架构。在这个模型中,有许多客户端应用程序,它们可以由任意人编写,因此通常是不可信的。因此,出于安全性和可靠性原因,这些客户端应用程序在与 DBMS 不同的地址空间中运行。这种选择的代价是应用程序在一个地址空间中运行,而 DBMS 处理在另一个地址空间中进行,并且需要进行进程切换才能从一个地址空间移动到另一个地址空间。

Relational DBMSs were all designed to have client-server architectures. In this model, there are many client applications, which can be written by arbitrary people, and which are therefore typically untrusted. Hence, for security and reliability reasons, these client applications are run in a separate address space from the DBMS. The cost of this choice is that the application runs in one address space while DBMS processing occurs in another, and a process switch is required to move from one address space to the other.

相比之下,进料警报应用程序是嵌入式系统的一个示例。它是由一个被信任“做正确的事”的人或团体编写的。整个应用程序由 (1) DBMS 处理组成,例如聚合和过滤框,(2) 将消息引导到正确的下一个处理步骤的控制逻辑,以及 (3) 应用程序逻辑。在StreamBase中,这三种功能可以自由穿插。应用程序逻辑由用户定义的框(我们的示例财务馈送处理应用程序中的 Count100 框)支持。实际代码,如图6所示,由四行 C++ 组成,计数到 100 并设置一个标志以确保发出正确的消息。通过在过滤框中允许多个谓词,从而支持多个退出弧来支持控制逻辑。因此,过滤器除了过滤流之外还执行“if-then-else”逻辑。

In contrast, the feed alarm application is an example of an embedded system. It is written by one person or group, who is trusted to “do the right thing”. The entire application consists of (1) DBMS processing—for example the aggregation and filter boxes, (2) control logic to direct messages to the correct next processing step, and (3) application logic. In StreamBase, these three kinds of functionality can be freely interspersed. Application logic is supported with user-defined boxes, the Count100 box in our example financial-feed processing application. The actual code, shown in Figure 6, consists of four lines of C++ that counts to 100 and sets a flag that ensures that the correct messages are emitted. Control logic is supported by allowing multiple predicates in a filter box, and thereby multiple exit arcs. As such, a filter box performs “if-then-else” logic in addition to filtering streams.

实际上,进给警报应用程序是 DBMS 风格的处理、条件表达式和传统编程语言中的用户定义函数的组合。这种组合由 StreamBase 在单个地址空间内执行,无需任何进程切换。这种 DBMS 逻辑与传统编程工具的无缝集成多年前就在 Rigel [ 23 ] 和 Pascal-R [ 25 ] 中提出,但从未在商业关系系统中实现。相反,主要供应商实施了存储过程,这是更有限的编程系统。最近,对象关系引擎的出现提供了刀片或扩展器,它们比存储过程更强大,但仍然不利于灵活的控制逻辑。

In effect, the feed alarm application is a mix of DBMS-style processing, conditional expressions, and user-defined functions in a conventional programming language. This combination is performed by StreamBase within a single address space without any process switches. Such a seamless integration of DBMS logic with conventional programming facilities was proposed many years ago in Rigel [23] and Pascal-R [25], but was never implemented in commercial relational systems. Instead, the major vendors implemented stored procedures, which are much more limited programming systems. More recently, the emergence of object-relational engines provided blades or extenders, which are more powerful than stored procedures, but still do not facilitate flexible control logic.

图像

图 6   “Count100”逻辑

Figure 6  “Count100” Logic

嵌入式系统不需要客户端-服务器 DBMS 提供的保护,并且两层架构只会产生开销。这是我们的示例应用程序中观察到的性能差异的第三个来源。

Embedded systems do not need the protection provided by client-server DBMSs, and a two-tier architecture merely generates overhead. This is a third source of the performance difference observed in our example application.

另一个集成问题(未在提要警报示例中举例说明)是流应用程序中状态信息的存储。大多数流处理应用程序都需要保存一些状态,从适度的兆字节到少量的千兆字节。此类状态信息可能包括(1)参考数据(即,感兴趣的股票),(2)转换表(以防提要对同一股票使用不同的代码),以及(3)历史数据(例如,“有多少股票”)去年每天都观察到晚蜱虫?”)。因此,数据的表格存储是大多数流处理应用程序的要求。

Another integration issue, not exemplified by the feed alarm example, is the storage of state information in streaming applications. Most stream processing applications require saving some state, anywhere from modest numbers of megabytes to small numbers of gigabytes. Such state information may include (1) reference data (i.e., what stocks are of interest), (2) translation tables (in case feeds use different symbols for the same stock), and (3) historical data (e.g., “how many late ticks were observed every day during the last year?”). As such, tabular storage of data is a requirement for most stream processing applications.

StreamBase 嵌入 BerkeleyDB [ 4 ] 用于状态存储。然而,在 StreamBase 地址空间中调用 BerkeleyDB 与在不同地址空间中以客户端-服务器模式调用它之间大约存在一个数量级的性能差异。这是通过在一个地址空间中混合 DBMS 和应用程序处理来避免进程切换的另一个原因。

StreamBase embeds BerkeleyDB [4] for state storage. However, there is approximately one order of magnitude performance difference between calling BerkeleyDB in the StreamBase address space and calling it in client-server mode in a different address space. This is yet another reason to avoid process switches by mixing DBMS and application processing in one address space.

尽管有人可能建议 DBMS 增强其编程模型来解决这一性能问题,但客户端-服务器 DBMS 的设计方式是有充分理由的。大多数业务数据处理应用程序都需要此模型提供的保护。存储过程和对象关系刀片试图将一些客户端逻辑移至服务器中以获得性能。为了进一步发展,DBMS 必须使用不同的运行时系统来实现嵌入式和非嵌入式模型。同样,这将等于放弃“一刀切”。

Although one might suggest that DBMSs enhance their programming models to address this performance problem, there are very good reasons why client-server DBMSs were designed the way they are. Most business data processing applications need the protection that is afforded by this model. Stored procedures and object-relational blades were an attempt to move some of the client logic into the server to gain performance. To move further, a DBMS would have to implement both an embedded and a non-embedded model, with different run time systems. Again, this would amount to giving up on “one size fits all”.

相比之下,饲料加工系统总是嵌入式应用。因此,应用程序和 DBMS 是由同一个人编写并驱动的来自外部源,而不是来自人类输入的交易。因此,没有理由保护 DBMS 免受应用程序的影响,并且在同一地址空间中运行两者是完全可以接受的。在嵌入式处理模型中,自由混合应用逻辑、控制逻辑和DBMS逻辑是合理的,这正是StreamBase所做的。

In contrast, feed processing systems are invariably embedded applications. Hence, the application and the DBMS are written by the same people, and driven from external feeds, not from human-entered transactions. As such, there is no reason to protect the DBMS from the application, and it is perfectly acceptable to run both in the same address space. In an embedded processing model, it is reasonable to freely mix application logic, control logic and DBMS logic, which is exactly what StreamBase does.

4.4  高可用

4.4  High Availability

许多基于流的应用程序都要求具有高可用性 (HA) 并保持 7x24 不间断运行。标准 DBMS 日志记录和崩溃恢复机制(例如,[ 22 ])不适合流世界,因为它们引入了几个关键问题。

It is a requirement of many stream-based applications to have high availability (HA) and stay up 7x24. Standard DBMS logging and crash recovery mechanisms (e.g., [22]) are ill-suited for the streaming world as they introduce several key problems.

首先,基于日志的恢复可能需要数秒到数分钟的时间。在此期间,应用程序将“宕机”。这种行为在许多实时流领域(例如金融服务)中显然是不受欢迎的。其次,在发生崩溃的情况下,必须采取一些措施来缓冲传入的数据流,否则这些数据将在恢复过程中丢失,无法挽回。第三,DBMS 恢复将仅处理表格状态,因此将忽略操作符状态。例如,在进料报警应用中,计数器不存储在表中;因此,他们的状态将在崩溃中丢失。一种简单的解决方法是强制将所有操作员状态放入表中以使用 DBMS 样式的恢复;然而,这个解决方案会显着减慢应用程序的速度。

First, log-based recovery may take large number of seconds to small numbers of minutes. During this period, the application would be “down”. Such behavior is clearly undesirable in many real-time streaming domains (e.g., financial services). Second, in case of a crash, some effort must be made to buffer the incoming data streams, as otherwise this data will be irretrievably lost during the recovery process. Third, DBMS recovery will only deal with tabular state and will thus ignore operator states. For example, in the feed alarm application, the counters are not in stored in tables; therefore their state would be lost in a crash. One straightforward fix would be to force all operator state into tables to use DBMS-style recovery; however, this solution would significantly slow down the application.

实现高可用性的明显替代方案是使用依赖于串联式进程对的技术 [ 11 ]。基本思想是,在发生崩溃的情况下,应用程序会执行故障转移到备份计算机,该备份计算机通常作为“热备用”运行,并以较小的延迟继续运行。这种方法消除了日志记录的开销。作为一个恰当的例子,StreamBase 关闭了 BerkeleyDB 中的日志记录。

The obvious alternative to achieve high availability is to use techniques that rely on Tandem-style process pairs [11]. The basic idea is that, in the case of a crash, the application performs failover to a backup machine, which typically operates as a “hot standby”, and keeps going with small delay. This approach eliminates the overhead of logging. As a case in point, StreamBase turns off logging in BerkeleyDB.

与需要精确恢复以保证正确性的传统数据处理应用程序不同,许多流处理应用程序可以容忍较弱的恢复概念并从中受益。换句话说,故障转移并不总是需要“完美”。考虑监视在其值定期刷新的数据流上运行的应用程序。当发生故障时,此类应用程序通常可以容忍元组丢失,只要此类中断很短。类似地,如果在故障转移期间在馈送警报应用程序中丢失了几个滴答声,则可能仍会保留正确性。相反,当某些事件组合发生时触发警报的应用程序要求不丢失元组,但可以容忍临时重复。例如,患者监控应用程序可能能够容忍重复的元组(“心率为 79”),但不能容忍丢失的元组(“心率已变为零”)。当然,总会有一类应用程序需要强大、精确的恢复保证。基于个股交易执行投资组合管理的金融应用程序就属于这一类。

Unlike traditional data-processing applications that require precise recovery for correctness, many stream-processing applications can tolerate and benefit from weaker notions of recovery. In other words, failover does not always need to be “perfect”. Consider monitoring applications that operate on data streams whose values are periodically refreshed. Such applications can often tolerate tuple losses when a failure occurs, as long as such interruptions are short. Similarly, if one loses a couple of ticks in the feed alarm application during failover, the correctness would probably still be preserved. In contrast, applications that trigger alerts when certain combinations of events happen, require that no tuples be lost, but may tolerate temporary duplication. For example, a patient monitoring application may be able to tolerate duplicate tuples (“heart rate is 79”) but not lost tuples (“heart rate has changed to zero”). Of course, there will always be a class of applications that require strong, precise recovery guarantees. A financial application that performs portfolio management based on individual stock transactions falls into this category.

因此,当较弱的正确性概念就足够时,就有机会设计简化且低开销的故障转移方案。最近探索了一系列有关如何在流媒体世界中实现高可用性的详细选项[ 17 ]。

As a result, there is an opportunity to devise simplified and low overhead failover schemes, when weaker correctness notions are sufficient. A collection of detailed options on how to achieve high availability in a streaming world has recently been explored [17].

4.5  同步

4.5  Synchronization

许多基于流的应用程序依赖于共享数据和计算。共享数据通常包含在一个查询更新而另一个查询读取的表中。例如,线性道路应用程序要求使用车辆位置数据来更新高速公路使用情况的统计数据,然后读取这些数据以确定高速公路上每个路段的通行费。因此,基本需要在消息之间提供隔离。

Many stream-based applications rely on shared data and computation. Shared data is typically contained in a table that one query updates and another one reads. For example, the Linear Road application requires that vehicle-position data be used to update statistics on highway usage, which in turn are read to determine tolls for each segment on the highway. Thus, there is a basic need to provide isolation between messages.

传统的 DBMS 使用 ACID 事务在多个用户提交的并发事务之间提供隔离(除其他外)。在非多用户的流系统中,这种隔离可以通过简单的关键部分有效地实现,而这可以通过轻量级信号量来实现。由于不需要成熟的事务,因此不再需要使用重量级的基于锁定的机制。

Traditional DBMSs use ACID transactions to provide isolation (among others things) between concurrent transactions submitted by multiple users. In streaming systems, which are not multi-user, such isolation can be effectively achieved through simple critical sections, which can be implemented through light-weight semaphores. Since full-fledged transactions are not required, there is no need to use heavy-weight locking-based mechanisms anymore.

总之,大多数流处理应用程序不需要 ACID 属性,并且可以使用更简单、专门的性能构造来发挥优势。

In summary, ACID properties are not required in most stream processing applications, and simpler, specialized performance constructs can be used to advantage.

5  一种尺寸适合所有人?

5  One Size Fits All?

上一节指出了一系列架构问题,这些问题导致专用流处理引擎和传统 DBMS 之间的性能存在显着差异。这些设计选择导致两种发动机的内部结构存在很大差异。事实上,StreamBase 中的运行时代码看起来与传统的 DBMS 运行时完全不同。最终结果是在一类实时应用程序上获得更好的性能。这些考虑因素将导致流处理的单独代码行,当然假设市场足够大以促进这种情况。

The previous section has indicated a collection of architectural issues that result in significant differences in performance between specialized stream processing engines and traditional DBMSs. These design choices result in a big difference between the internals of the two engines. In fact, the run-time code in StreamBase looks nothing like a traditional DBMS run-time. The net result is vastly better performance on a class of real-time applications. These considerations will lead to a separate code line for stream processing, of course assuming that the market is large enough to facilitate this scenario.

在本节的其余部分中,我们概述了专用数据库引擎可能可行的其他几个市场。

In the rest of the section, we outline several other markets for which specialized database engines may be viable.

5.1  数据仓库

5.1  Data Warehouses

第 2 节中讨论的 OLTP 和仓库数据库系统之间的架构差异只是冰山一角,随着时间的推移,还会出现更多差异。我们现在关注可能最大的架构差异,即按列存储数据,而不是按行存储数据。

The architectural differences between OLTP and warehouse database systems discussed in Section 2 are just the tip of the iceberg, and additional differences will occur over time. We now focus on probably the biggest architectural difference, which is to store the data by column, rather than by row.

所有主要的 DBMS 供应商都实现面向记录的存储系统,其中记录的属性连续放置在存储中。使用这种“行存储”架构,只需一次磁盘写入即可将单个记录的所有属性推送到磁盘。因此,这样的系统是“写入优化的”,因为可以轻松实现记录写入的高性能。很容易看出,写优化系统对于 OLTP 类型的应用程序尤其有效,这也是大多数商业 DBMS 采用这种架构的主要原因。

All major DBMS vendors implement record-oriented storage systems, where the attributes of a record are placed contiguously in storage. Using this “row-store” architecture, a single disk write is all that is required to push all of the attributes of a single record out to disk. Hence, such a system is “write-optimized” because high performance on record writes is easily achievable. It is easy to see that write-optimized systems are especially effective on OLTP-style applications, the primary reason why most commercial DBMSs employ this architecture.

相比之下,仓库系统需要“读取优化”,因为大多数工作负载由涉及大量历史数据的临时查询组成。在此类系统中,连续存储单个属性的所有行的值的“列存储”模型效率大大提高(如 Sybase IQ [6]、Addamark [1] 和 KDB [2 ]证明那样) )。

In contrast, warehouse systems need to be “read-optimized” as most workload consists of ad-hoc queries that touch large amounts of historical data. In such systems, a “column-store” model where the values for all of the rows of a single attribute are stored contiguously is drastically more efficient (as demonstrated by Sybase IQ [6], Addamark [1], and KDB [2]).

通过列存储架构,DBMS 只需读取处理给定查询所需的属性,并且可以避免将任何其他不相关的属性带入内存。鉴于具有数百个属性(具有许多空值)的记录变得越来越普遍,这种方法为仓库工作负载带来了相当大的性能优势,其中典型查询涉及根据大型数据集上的少量属性计算的聚合。本文的第一作者正在进行一个研究项目,旨在探索列存储系统的性能优势。

With a column-store architecture, a DBMS need only read the attributes required for processing a given query, and can avoid bringing into memory any other irrelevant attributes. Given that records with hundreds of attributes (with many null values) are becoming increasingly common, this approach results in a sizeable performance advantage for warehouse workloads where typical queries involve aggregates that are computed on a small number of attributes over large data sets. The first author of this paper is engaged in a research project to explore the performance benefits of a column-store system.

5.2  传感器网络

5.2  Sensor Networks

在管理传感器网络中的传感器的处理节点中运行传统的 DBMS 是不切实际的[ 21 , 24 ]。这些新兴的设备网络平台目前正在探索用于环境和医疗监测、工业自动化、自主机器人团队智能家居等应用[ 16、19、26、28、29 ] 。

It is not practical to run a traditional DBMS in the processing nodes that manage sensors in a sensor network [21, 24]. These emerging platforms of device networks are currently being explored for applications such as environmental and medical monitoring, industrial automation, autonomous robotic teams, and smart homes [16, 19, 26, 28, 29].

为了充分发挥这些系统的潜力,这些组件在通信和能源方面都被设计为无线的。在这种环境下,带宽和功耗成为需要节省的关键资源。此外,与处理或存储访问相反,通信是能源的主要消耗者。因此,标准 DBMS 优化策略不适用,需要重新仔细考虑。此外,事务能力似乎与该领域无关。

In order to realize the full potential of these systems, the components are designed to be wireless, with respect to both communication and energy. In this environment, bandwidth and power become the key resources to be conserved. Furthermore, communication, as opposed to processing or storage access, is the main consumer of energy. Thus, standard DBMS optimization tactics do not apply and need to be critically rethought. Furthermore, transactional capabilities seem to be irrelevant in this domain.

一般来说,需要设计灵活、轻量级的数据库抽象(例如 TinyDB [ 18 ]),这些抽象针对数据移动而不是数据存储进行了优化。

In general, there is a need to design flexible, light-weight database abstractions (such as TinyDB [18]) that are optimized for data movement as opposed to data storage.

5.3 文本搜索

5.3 Text Search

当前的文本搜索引擎都没有使用 DBMS 技术进行存储,尽管它们处理的是海量且不断增长的数据集。例如,Google 构建了自己的存储系统(称为 GFS [ 15 ]),其性能优于传统 DBMS 技术(以及文件系统技术),原因如第 4 节中讨论的。

None of the current text search engines use DBMS technology for storage, even though they deal with massive, ever-increasing data sets. For instance, Google built its own storage system (called GFS [15]) that outperforms conventional DBMS technology (as well as file system technology) for some of the reasons discussed in Section 4.

典型的搜索引擎工作负载 [ 12 , 15 ] 由需要清理并合并到现有搜索索引中的入站流数据(来自网络爬虫)以及对现有索引的临时查找操作组成。特别是,写操作大多是仅附加的,而读操作是顺序的。为了获得良好的性能,对同一文件的并发写入(即追加)是必要的。最后,由商品部件组成的大量存储机器确保故障成为常态而不是例外。因此,高可用性是一个关键的设计考虑因素,只能通过快速恢复和复制来实现。

A typical search engine workload [12, 15] consists of a combination of inbound streaming data (coming from web crawlers), which needs to be cleaned and incorporated into the existing search index, and ad hoc look-up operations on the existing index. In particular, the write operations are mostly append-only and read operations sequential. Concurrent writes (i.e., appends) to the same file are necessary for good performance. Finally, the large number of storage machines, made up of commodity parts, ensure that failure is the norm rather than the exception. Hence, high availability is a key design consideration and can only be achieved through fast recovery and replication.

显然,这些应用程序的特征与传统业务处理应用程序的特征有很大不同。因此,尽管一些 DBMS 具有内置的文本搜索功能,但它们仍无法满足该领域的性能和可用性要求:它们太笨重且不灵活。

Clearly, these application characteristics are much different from those of conventional business-processing applications. As a result, even though some DBMSs has built-in text search capabilities, they fall short of meeting the performance and availability requirements of this domain: they are simply too heavy-weight and inflexible.

5.4  科学数据库

5.4  Scientific Databases

附着在卫星和显微镜等设备上的各种类型的传感器不断从现实世界收集大量数据,或者通过高分辨率科学和工程模拟人工生成。

Massive amounts of data are continuously being gathered from the real-world by sensors of various types, attached to devices such as satellites and microscopes, or are generated artificially by high-resolution scientific and engineering simulations.

分析。此类数据集的分析是更好地理解物理现象的关键,并且在许多科学研究领域变得越来越普遍。对这些庞大数据库的高效分析和查询需要高效的多维索引结构和特定于应用程序的聚合技术。此外,对高效数据归档、分段、沿袭和错误传播技术的需求可能会在这个重要领域中产生对另一个专用引擎的需求。

The analysis. of such data sets is the key to better understanding physical phenomena and is becoming increasingly commonplace in many scientific research domains. Efficient analysis and querying of these vast databases require highly-efficient multi-dimensional indexing structures and application-specific aggregation techniques. In addition, the need for efficient data archiving, staging, lineage, and error propagation techniques may create a need for yet another specialized engine in this important domain.

5.5   XML 数据库

5.5  XML Databases

半结构化数据无处不在。不幸的是,此类数据并不能立即适合关系模型。关于如何最好地存储和操作 XML 数据存在着激烈的争论。尽管有些人认为关系型 DBMS(带有适当的扩展)是可行的方法,其他人会认为需要一个专门的引擎来存储和处理这种数据格式。

Semi-structured data is everywhere. Unfortunately, such data does not immediately fit into the relational model. There is a heated ongoing debate regarding how to best store and manipulate XML data. Even though some believe that relational DBMSs (with proper extensions) are the way to go, others would argue that a specialized engine is needed to store and process this data format.

6  对保理的评论

6  A Comment on Factoring

大多数基于流的应用程序需要三个基本服务:

Most stream-based applications require three basic services:

•  消息传输:在许多流应用中,需要在多个分布式机器之间高效可靠地传输数据。造成这些现象的原因有三个。首先,数据源和目的地通常在地理上是分散的。其次,高性能和可用性要求决定了多台协作服务器的使用。第三,几乎所有大型企业系统都由运行在大量机器上的复杂业务应用程序网络组成,其中嵌入了 SPE。因此,SPE 的输入和输出消息需要正确地从适当的外部应用程序路由到适当的外部应用程序。

•  Message transport: In many stream applications, there is a need to transport data efficiently and reliably among multiple distributed machines. The reasons for these are threefold. First, data sources and destinations are typically geographically dispersed. Second, high performance and availability requirements dictate the use of multiple cooperating server machines. Third, virtually all big enterprise systems consist of a complicated network of business applications running on a large number of machines, in which an SPE is embedded. Thus, the input and outputs messages to the SPE need to be properly routed from and to the appropriate external applications.

•  状态存储:如第 4.3 节所述,除了最简单的应用程序外,所有应用程序都需要存储状态,通常以只读引用和历史表以及读写转换(例如哈希)的形式存储状态。表。

•  Storage of state: As discussed in Section 4.3, in all but the most simplistic applications, there is a need to store state, typically in the form of read-only reference and historical tables, and read-write translation (e.g., hash) tables.

•  应用程序逻辑的执行:许多流应用程序要求将特定于域的消息处理与查询活动穿插在一起。一般来说,仅使用内置查询原语(例如,想想遗留代码)来表示此类应用程序逻辑既不可能也不实际。

•  Execution of application logic: Many streaming applications demand domain-specific message processing to be interspersed with query activity. In general, it is neither possible nor practical to represent such application logic using only the built-in query primitives (e.g., think legacy code).

流处理应用程序的传统设计将整个应用程序逻辑分布在三个不同的系统中:(1)消息传递系统(例如 MQSeries、WebMethods 或 Tibco),用于可靠地连接组件系统,通常使用发布/订阅范例;(2) DBMS(例如DB2或Oracle)提供状态信息的持久性;(3)一个应用程序服务器(例如WebSphere 或WebLogic),为一组定制编码的程序提供应用程序服务。这种三层配置如图7所示。

A traditional design for a stream-processing application spreads the entire application logic across three diverse systems: (1) a messaging system (such as MQSeries, WebMethods, or Tibco) to reliably connect the component systems, typically using a publish/subscribe paradigm; (2) a DBMS (such as DB2 or Oracle) to provide persistence for state information; and (3) an application server (such as WebSphere or WebLogic) to provide application services to a set of custom-coded programs. Such a three-tier configuration is illustrated in Figure 7.

不幸的是,这种将所需功能分布在三个重量级系统软件上的设计将无法很好地执行。例如,每条需要状态查找和应用程序服务的消息都需要在这些不同的服务之间进行多个进程切换。

Unfortunately, such a design that spreads required functionality over three heavyweight pieces of system software will not perform well. For example, every message that requires state lookup and application services will entail multiple process switches between these different services.

为了说明每个消息的开销,我们跟踪处理消息时所采取的步骤。传入消息首先由总线拾取,然后转发到自定义应用程序代码(步骤 1),该代码会清理并处理消息。如果消息需要与历史数据关联或需要访问持久数据,则会向数据库服务器发送请求(步骤 2-3),由数据库服务器访问 DBMS。响应遵循与应用程序代码相反的路径(步骤 4-5)。最后,处理后的消息的结果被转发到客户端任务 GUI(步骤 6)。总的来说,处理单个消息有六个“边界交叉点”。除了明显的上下文切换之外,消息还需要通过适当的适配器在系统的本机格式之间进行即时转换,每次它们都被从消息总线中拾取并传递到消息总线上。结果是有用功与间接费用之比​​非常低。即使存在一些消息批处理,开销也会很高并限制可实现的性能。

In order to illustrate this per message overhead, we trace the steps taken when processing a message. An incoming message is first picked up by the bus and then forwarded to the custom application code (step 1), which cleans up and then processes the message. If the message needs to be correlated with historical data or requires access to persistent data, then a request is sent to the DB server (steps 2-3), which accesses the DBMS. The response follows the reverse path to the application code (steps 4-5). Finally, the outcome of the processed message is forwarded to the client task GUI (step 6). Overall, there are six “boundary crossings” for processing a single message. In addition to the obvious context switches incurred, messages also need to transformed on-the-fly, by the appropriate adapters, to and from the native formats of the systems, each time they are picked up from and passed on to the message bus. The result is a very low useful work to overhead ratio. Even if there is some batching of messages, the overhead will be high and limit achievable performance.

图像

图7  多层流处理架构

Figure 7  A multi-tier stream processing architecture

为了避免这种性能损失,流处理引擎必须在单个系统软件中提供所有三项服务,该系统软件在其运行的每台计算机上作为一个多线程进程执行。因此,SPE 必须具有 DBMS、应用程序服务器和消息传递系统等元素。实际上,SPE 应该“在一个屋檐下”提供所有三种软件的专业功能。

To avoid such a performance hit, a stream processing engine must provide all three services in a single piece of system software that executes as one multithreaded process on each machine that it runs. Hence, an SPE must have elements of a DBMS, an application server, and a messaging system. In effect, an SPE should provide specialized capabilities from all three kinds of software “under one roof”.

这一观察提出了一个问题:当前将系统软件分解为组件(例如,应用程序服务器、DBMS、提取-转换-加载系统、消息总线、文件系统、Web 服务器等)是否实际上是最佳的。毕竟,这种特殊的分解部分是历史文物,部分是营销偶然事件。系统服务的其他分解似乎同样合理,并且看到组件定义和未来分解的巨大发展也就不足为奇了。

This observation raises the question of whether the current factoring of system software into components (e.g., application server, DBMS, Extract-Transform-Load system, message bus, file system, web server, etc.) is actually an optimal one. After all, this particular decomposition arose partly as a historical artifact and partly from marketing happenstance. It seems like other factoring of systems services seems equally plausible, and it should not be surprising to see considerable evolution of component definition and factoring off into the future.

7 结束语

7 Concluding Remarks

总之,未来可能会出现大量具有不同功能的特定于领域的数据库引擎。我们想起了咒语“愿你生活在有趣的时代”。我们相信 DBMS 市场正在进入一个非常有趣的时代。有多种现有的和新兴的应用程序可以从数据管理和处理原理和技术中受益。同时,这些应用程序与业务数据处理以及彼此之间有很大不同——似乎没有明显的方法可以用单个代码行支持它们。在这种情况下,“一刀切”的主题不太可能成功延续。

In summary, there may be a substantial number of domain-specific database engines with differing capabilities off into the future. We are reminded of the curse “may you live in interesting times”. We believe that the DBMS market is entering a period of very interesting times. There are a variety of existing and newly-emerging applications that can benefit from data management and processing principles and techniques. At the same time, these applications are very much different from business data processing and from each other—there seems to be no obvious way to support them with a single code line. The “one size fits all” theme is unlikely to successfully continue under these circumstances.

参考

References

  [ 1 ] Addamark 可扩展日志服务器。http://www.addamark.com/products/sls.htm

  [1]  Addamark Scalable Log Server. http://www.addamark.com/products/sls.htm.

  [ 2 ]Kx系统。http://www.kx.com/

  [2]  Kx systems. http://www.kx.com/.

  [ 3 ]   Lojack.com,2004。http ://www.lojack.com/

  [3]  Lojack.com, 2004. http://www.lojack.com/.

  [ 4 ] 瞌睡猫软件。http://www.sleepycat.com/

  [4]  Sleepycat software. http://www.sleepycat.com/.

  [ 5 ] StreamBase Inc. http://www.streambase.com/

  [5]  StreamBase Inc. http://www.streambase.com/.

  [ 6 ] Sybase IQ。http://www.sybase.com/products/databaseservers/sybaseiq

  [6]  Sybase IQ. http://www.sybase.com/products/databaseservers/sybaseiq.

  [ 7 ] D. Abadi、D. Carney、U. Cetintemel、M. Cherniack、C. Convey、C. Erwin、E. Galvez、M. Hatoun、J. Hwang、A. Maskey、A. Rasin、A. Singer 、M. Stonebraker、N. Tatbul、Y. Zing、R. Yan 和 S. Zdonik。Aurora:数据流管理系统(演示说明)。2003 年 ACM SIGMOD 数据管理会议论文集,加利福尼亚州圣地亚哥,2003 年。

  [7]  D. Abadi, D. Carney, U. Cetintemel, M. Cherniack, C. Convey, C. Erwin, E. Galvez, M. Hatoun, J. Hwang, A. Maskey, A. Rasin, A. Singer, M. Stonebraker, N. Tatbul, Y. Zing, R. Yan, and S. Zdonik. Aurora: A Data Stream Management System (demo description). In Proceedings of the 2003 ACM SIGMOD Conference on Management of Data, San Diego, CA, 2003.

  [ 8 ] D. Abadi、D. Carney、U. Cetintemel、M. Cherniack、C. Convey、S. Lee、M. Stonebraker、N. Tatbul 和 S. Zdonik。Aurora:数据流管理的新模型和架构。VLDB杂志,2003年。

  [8]  D. Abadi, D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik. Aurora: A New Model and Architecture for Data Stream Management. VLDB Journal, 2003.

  [ 9 ] A. Arasu、M. Cherniack、E. Galvez、D. Maier、A. Maskey、E. Ryvkina、M. Stonebraker 和 R. Tibbetts。线性路:流数据管理系统的基准。第 30 届国际超大型数据库会议 (VLDB) 会议记录,加利福尼亚州多伦多,2004 年。

  [9]  A. Arasu, M. Cherniack, E. Galvez, D. Maier, A. Maskey, E. Ryvkina, M. Stonebraker, and R. Tibbetts. Linear Road: A Benchmark for Stream Data Management Systems. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), Toronto, CA, 2004.

[ 10 ] MM Astrahan、MW Blasgen、DD Chamberlin、KP Eswaran、JN Gray、PP Griffiths、WF King、RA Lorie、PR McJones、JW Mehl、GR Putzolu、IL Traiger、B. Wade 和 V. Watson。System R:数据库管理的关系方法。ACM 数据库系统交易,1976 年。

[10]  M. M. Astrahan, M. W. Blasgen, D. D. Chamberlin, K. P. Eswaran, J. N. Gray, P. P. Griffiths, W. F. King, R. A. Lorie, P. R. McJones, J. W. Mehl, G. R. Putzolu, I. L. Traiger, B. Wade, and V. Watson. System R: A Relational Approach to Database Management. ACM Transactions on Database Systems, 1976.

[ 11 ] J. Barlett、J. Gray 和 B. Horst。串联计算机系统中的容错。串联计算机技术报告 86.2.,1986。

[11]  J. Barlett, J. Gray, and B. Horst. Fault tolerance in Tandem computer systems. Tandem Computers Technical Report 86.2., 1986.

[ 12 ] E. Brewer,“结合系统和数据库:搜索引擎回顾”,《数据库系统读物》,M. Stonebraker 和 J. Hellerstein,编辑,第 4 版,2004 年。

[12]  E. Brewer, “Combining systems and databases: a search engine retrospective,” in Readings in Database Systems, M. Stonebraker and J. Hellerstein, Eds., 4 ed, 2004.

[ 13 ] D. Carney、U. Cetintemel、M. Cherniack、C. Convey、S. Lee、G. Seidman、M. Stonebraker、N. Tatbul 和 S. Zdonik。监控流:一类新的数据管理应用程序。第 28 届国际超大型数据库会议 (VLDB'02) 会议记录,中国香港,2002 年。

[13]  D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring Streams: A New Class of Data Management Applications. In proceedings of the 28th International Conference on Very Large Data Bases (VLDB’02), Hong Kong, China, 2002.

[ 14 ] S. Chandrasekaran、O. Cooper、A. Deshpande、MJ Franklin、JM Hellerstein、W. Hong、S. Krishnamurthy、SR Madden、V. Raman、F. Reiss 和 MA Shah。TelegraphCQ:不确定世界的连续数据流处理。在过程中。第一届 CIDR 会议,阿西洛玛,加利福尼亚州,2003 年。

[14]  S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. R. Madden, V. Raman, F. Reiss, and M. A. Shah. TelegraphCQ: Continuous Dataflow Processing for an Uncertain World. In Proc. of the 1st CIDR Conference, Asilomar, CA, 2003.

[ 15 ] S. Ghemawat、H. Gobioff 和 S.-T。梁. 谷歌文件系统。第十九届 ACM 操作系统原理研讨会(SOSP)论文集,美国纽约州博尔顿兰丁,2003 年。

[15]  S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles (SOSP), Bolton Landing, NY, USA, 2003.

[ 16 ] T. He、S. Krishnamurthy、JA Stankovic、T. Abdelzaher、L. Luo、R. Stoleru、T. Yan、L. Gu、J. Hui 和 B. Krogh。使用无线传感器网络的节能监控系统。在MobiSys'04,2004年。

[16]  T. He, S. Krishnamurthy, J. A. Stankovic, T. Abdelzaher, L. Luo, R. Stoleru, T. Yan, L. Gu, J. Hui, and B. Krogh. An Energy-Efficient Surveillance System Using Wireless Sensor Networks. In MobiSys’04, 2004.

[ 17 ]J.-H。Hwang、M. Balazinska、A. Rasin、U. Cetintemel、M. Stonebraker 和 S. Zdonik。分布式流处理的高可用性算法。国际数据工程会议论文集,日本东京,2004 年。

[17]  J.-H. Hwang, M. Balazinska, A. Rasin, U. Cetintemel, M. Stonebraker, and S. Zdonik. High-Availability Algorithms for Distributed Stream Processing. In Proceedings of the International Conference on Data Engineering, Tokyo, Japan, 2004.

[ 18 ] S. Madden、M. Franklin、J. Hellerstein 和 W. Hong。传感器网络采集查询处理器的设计。SIGMOD 会议记录,圣地亚哥,加利福尼亚州,2003 年。

[18]  S. Madden, M. Franklin, J. Hellerstein, and W. Hong. The Design of an Acquisitional Query Processor for Sensor Networks. In Proceedings of SIGMOD, San Diego, CA, 2003.

[ 19 ]D. Malan、T. Fulford-Jones、M. Welsh 和 S. Moulton。CodeBlue:用于紧急医疗护理的特设传感器网络基础设施。在WAMES'04,2004年。

[19]  D. Malan, T. Fulford-Jones, M. Welsh, and S. Moulton. CodeBlue: An Ad Hoc Sensor Network Infrastructure for Emergency Medical Care. In WAMES’04, 2004.

[ 20 ] R. Motwani、J. Widom、A. Arasu、B. Babcock、S. Babu、M. Datar、G. Manku、C. Olston、J. Rosenstein 和 R. Varma。数据流管理系统中的查询处理、资源管理和近似。在过程中。第一届创新数据系统研究双年度会议 (CIDR 2003),阿西洛玛,加利福尼亚州,2003 年。

[20]  R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein, and R. Varma. Query Processing, Resource Management, and Approximation and in a Data Stream Management System. In Proc. of the First Biennial Conference on Innovative Data Systems Research (CIDR 2003), Asilomar, CA, 2003.

[ 21 ] G. Pottie 和 W. Kaiser。无线集成网络传感器。ACM 的通讯

[21]  G. Pottie and W. Kaiser. Wireless Integrated Network Sensors. Communications of the ACM.

[ 22 ] K. Rothermel 和 C. Mohan。ARIES/NT:基于嵌套事务预写日志记录的恢复方法。在过程中。第 15 届国际超大型数据库会议 (VLDB),荷兰阿姆斯特丹,1989 年。

[22]  K. Rothermel and C. Mohan. ARIES/NT: A Recovery Method Based on Write-Ahead Logging for Nested Transactions. In Proc. 15th International Conference on Very Large Data Bases (VLDB), Amsterdam, Holland, 1989.

[ 23 ] LA Rowe 和 KA Shoens。RIGEL 中的数据抽象、视图和更新。1979 年 ACM SIGMOD 国际数据管理会议 (SIGMOD) 会议记录,马萨诸塞州波士顿,1979 年。

[23]  L. A. Rowe and K. A. Shoens. Data abstraction, views and updates in RIGEL. In Proceedings of the 1979 ACM SIGMOD international conference on Management of data (SIGMOD), Boston, Massachusetts, 1979.

[ 24 ]P.萨福。传感器:下一波信息。ACM 的通讯

[24]  P. Saffo. Sensors: The Next Wave of Information. Communications of the ACM.

[ 25 ]JW施密思。类型关系数据的一些高级语言构造。数据库系统汇刊,2(247-261,1977。

[25]  J. W. Schmidth. Some High-Level Language Constructs for Data of Type Relation. Transactions on Database Systems, 2(247-261, 1977.

[ 26 ] L. Schwiebert、S. Gupta 和 J. Weinmann。生物医学传感器无线网络的研究挑战。在Mobicom'01,2001年。

[26]  L. Schwiebert, S. Gupta, and J. Weinmann. Research Challenges in Wireless Networks of Biomedical Sensors. In Mobicom’01, 2001.

[ 27 ] M. Stonebraker、E. Wong、P. Kreps 和 G. Held。INGRES 的设计和实现。ACM 翻译。数据库系统,1(3):189-222,1976。

[27]  M. Stonebraker, E. Wong, P. Kreps, and G. Held. The Design and Implementation of INGRES. ACM Trans. Database Systems, 1(3):189-222, 1976.

[ 28 ] R. Szewczyk、J. Polastre、A. Mainwaring 和 D. Culler。传感器网络探险的经验教训。在EWSN'04,2004年。

[28]  R. Szewczyk, J. Polastre, A. Mainwaring, and D. Culler. Lessons from a Sensor Network Expedition. In EWSN’04, 2004.

[ 29 ] CS 刘婷,张培和玛格丽特·马托诺西。在资源受限的移动传感器上实施软件:Impala 和 ZebraNet 的经验。在MobiSys'04,2004年。

[29]  C. S. Ting Liu, Pei Zhang and Margaret Martonosi. Implementing Software on Resource-Constrained Mobile Sensors: Experiences with Impala and ZebraNet. In MobiSys’04, 2004.

最初发表于第 21 届国际数据工程会议论文集,第 2-11 页,2005 年。原始 DOI:10.1109/ICDE.2005.1

Originally published in Proceedings of the 21st International Conference on Data Engineering, pp. 2–11, 2005. Original DOI: 10.1109/ICDE.2005.1

建筑时代的终结(是时候彻底重写了)

The End of an Architectural Era (It’s Time for a Complete Rewrite)

Michael Stonebraker ( MIT CSAIL ) 、Samuel Madden ( MIT CSAIL ) 、Daniel J. Abadi ( MIT CSAIL ) 、Stavros Harizopoulos ( MIT CSAIL ) 、Nabil Hachem ( AvantGarde Consulting, LLC ) 、Pat Helland ( Microsoft Corporation )

Michael Stonebraker (MIT CSAIL), Samuel Madden (MIT CSAIL), Daniel J. Abadi (MIT CSAIL), Stavros Harizopoulos (MIT CSAIL), Nabil Hachem (AvantGarde Consulting, LLC), Pat Helland (Microsoft Corporation)

抽象的

Abstract

在之前的论文 [ SC05SBC+07 ] 中,我们中的一些人预测“一刀切”作为商业关系 DBMS 范式的终结。这些论文提出了理由和实验证据,表明主要 RDBMS 供应商在数据仓库、流处理、文本和科学数据库市场中的专用引擎的表现可以高出 1-2 个数量级。

In previous papers [SC05, SBC+07], some of us predicted the end of “one size fits all” as a commercial relational DBMS paradigm. These papers presented reasons and experimental evidence that showed that the major RDBMS vendors can be outperformed by 1–2 orders of magnitude by specialized engines in the data warehouse, stream processing, text, and scientific database markets.

假设随着时间的推移,专用引擎主导这些市场,当前的关系 DBMS 代码行将留给业务数据处理(OLTP)市场和需要多种功能的混合市场。在本文中,我们表明当前的 RDBMS 在 OLTP 市场中也可以被击败近两个数量级。实验证据来自于将我们在 MIT 构建的新 OLTP 原型 H-Store 与基于标准事务基准 TPC-C 的流行 RDBMS 进行比较。

Assuming that specialized engines dominate these markets over time, the current relational DBMS code lines will be left with the business data processing (OLTP) market and hybrid markets where more than one kind of capability is required. In this paper we show that current RDBMSs can be beaten by nearly two orders of magnitude in the OLTP market as well. The experimental evidence comes from comparing a new OLTP prototype, H-Store, which we have built at M.I.T. to a popular RDBMS on the standard transactional benchmark, TPC-C.

我们得出的结论是,当前的 RDBMS 代码行虽然试图成为“一刀切”的解决方案,但实际上并没有什么优势。因此,它们是已有 25 年历史的遗留代码行,应该被淘汰,取而代之的是“从头开始”的专用引擎集合。DBMS 供应商(和研究社区)应该从头开始,设计满足昨天需求的系统。

We conclude that the current RDBMS code lines, while attempting to be a “one size fits all” solution, in fact, excel at nothing. Hence, they are 25 year old legacy code lines that should be retired in favor of a collection of “from scratch” specialized engines. The DBMS vendors (and the research community) should start with a clean sheet of paper and design systems for yesterday’s needs.

1  简介

1  Introduction

流行的关系型 DBMS 的根源都可以追溯到 20 世纪 70 年代的 System R。例如,DB2 是 System R 的直系后代,在其第一个版本中完整地使用了 System R 的 RDS 部分。同样,SQL Server 是 Sybase System 5 的直接后代,Sybase System 5 大量借鉴了 System R。最后,Oracle 的第一个版本实现了 System R 的用户界面。

The popular relational DBMSs all trace their roots to System R from the 1970s. For example, DB2 is a direct descendent of System R, having used the RDS portion of System R intact in their first release. Similarly, SQL Server is a direct descendent of Sybase System 5, which, borrowed heavily from System R. Lastly, the first release of Oracle implemented the user interface from System R.

这三个系统都是在 25 年前构建的,当时的硬件特性与今天有很大不同。处理器速度提高数千倍,存储器容量增大数千倍。磁盘容量已大幅增加,如果愿意的话,几乎可以保留所有内容。然而,磁盘和主内存之间的带宽增长速度要慢得多。人们预计,这种不断发展的技术步伐将在过去四分之一个世纪中极大地改变数据库系统的架构,但令人惊讶的是,大多数 DBMS 的架构本质上与 System R 的架构相同。

All three systems were architected more than 25 years ago, when hardware characteristics were much different than today. Processors are thousands of times faster and memories are thousands of times larger. Disk volumes have increased enormously, making it possible to keep essentially everything, if one chooses to. However, the bandwidth between disk and main memory has increased much more slowly. One would expect this relentless pace of technology to have changed the architecture of database systems dramatically over the last quarter of a century, but surprisingly the architecture of most DBMSs is essentially identical to that of System R.

此外,在关系型 DBMS 诞生之时,只有一个 DBMS 市场:业务数据处理。在过去 25 年中,许多其他市场也发生了发展,包括数据仓库、文本管理和流处理。这些市场的要求与业务数据处理截然不同。

Moreover, at the time relational DBMSs were conceived, there was only a single DBMS market, business data processing. In the last 25 years, a number of other markets have evolved, including data warehouses, text management, and stream processing. These markets have very different requirements than business data processing.

最后,RDBMS 架构时的主要用户界面设备是哑终端,供应商想象操作员通过交互式终端提示输入查询。现在它是一台连接到万维网的强大个人计算机。使用 OLTP DBMS 的网站很少运行交互式事务或向用户提供直接 SQL 界面。

Lastly, the main user interface device at the time RDBMSs were architected was the dumb terminal, and vendors imagined operators inputting queries through an interactive terminal prompt. Now it is a powerful personal computer connected to the World Wide Web. Web sites that use OLTP DBMSs rarely run interactive transactions or present users with direct SQL interfaces.

总之,当前的 RDBMS 是针对不同用户界面和不同硬件特性时代的业务数据处理市场而构建的。因此,它们都包含以下 System R 架构功能:

In summary, the current RDBMSs were architected for the business data processing market in a time of different user interfaces and different hardware characteristics. Hence, they all include the following System R architectural features:

• 面向磁盘的存储和索引结构

•  Disk oriented storage and indexing structures

• 多线程隐藏延迟

•  Multithreading to hide latency

• 基于锁的并发控制机制

•  Locking-based concurrency control mechanisms

• 基于日志的恢复

•  Log-based recovery

当然,这些年来已经有了一些扩展,包括对压缩的支持、共享磁盘架构、位图索引、对用户定义的数据类型和运算符的支持等。但是,从一开始就没有一个系统进行过彻底的重新设计。本文认为,现在是彻底重写的时候了。

Of course, there have been some extensions over the years, including support for compression, shared-disk architectures, bitmap indexes, support for user-defined data types and operators, etc. However, no system has had a complete redesign since its inception. This paper argues that the time has come for a complete rewrite.

之前的一篇论文 [ SBC+07 ] 提供了基准测试证据,表明主要 RDBMS 在多个应用领域可能会被专用架构击败一个数量级或更多,包括:

A previous paper [SBC+07] presented benchmarking evidence that the major RDBMSs could be beaten by specialized architectures by an order of magnitude or more in several application areas, including:

• 文本(来自 Google、Yahoo 等的专用引擎)

•  Text (specialized engines from Google, Yahoo, etc.)

• 数据仓库(列存储,如Vertica、Monet [ Bon02 ]等)

•  Data Warehouses (column stores such as Vertica, Monet [Bon02], etc.)

• 流处理(StreamBase 和 Coral8 等流处理引擎)

•  Stream Processing (stream processing engines such as StreamBase and Coral8)

• 科学和情报数据库(阵列存储引擎,例如 MATLAB 和 ASAP [ SBC+07 ])

•  Scientific and intelligence databases (array storage engines such as MATLAB and ASAP [SBC+07])

根据这一证据,可以得出以下结论:

Based on this evidence, one is led to the following conclusions:

1. RDBMS 是为业务数据处理市场而设计的,这是它们的最佳选择

1.  RDBMSs were designed for the business data processing market, which is their sweet spot

2. 在大多数其他规模足够大、足以保证对专用引擎进行投资的市场中,它们可以轻松击败

2.  They can be beaten handily in most any other market of significant enough size to warrant the investment in a specialized engine

本文以 [ SBC+07 ] 为基础,提出了当前 RDBMS 架构甚至不适合业务数据处理的证据。我们的方法与 [ SBC+07 ]中采用的方法类似。具体来说,我们为 OLTP 应用程序设计了一个新的 DBMS 引擎。H-Store 这个引擎已经足够运行,使我们能够在它和流行的商业 RDBMS 之间进行性能测试。我们的实验数据显示 H-Store 在 TPC-C 上的速度提高了 82 倍(几乎两个数量级)。

This paper builds on [SBC+07] by presenting evidence that the current architecture of RDBMSs is not even appropriate for business data processing. Our methodology is similar to the one employed in [SBC+07]. Specifically, we have designed a new DBMS engine for OLTP applications. Enough of this engine, H-Store, is running to enable us to conduct a performance bakeoff between it and a popular commercial RDBMSs. Our experimental data shows H-Store to be a factor of 82 faster on TPC-C (almost two orders of magnitude).

由于 RDBMS 在标准 OLTP 基准测试中可能会被击败一个数量级以上,因此它们不存在具有竞争力的市场。因此,它们应该被视为已有四分之一个世纪以上历史的遗留技术,对其进行彻底的重新设计和重新架构是适当的下一步。

Because RDBMSs can be beaten by more than an order of magnitude on the standard OLTP benchmark, then there is no market where they are competitive. As such, they should be considered as legacy technology more than a quarter of a century in age, for which a complete redesign and re-architecting is the appropriate next step.

本文第 2 部分解释了可用于在 TPC-C 上实现 82 系数的设计注意事项。然后,在第 3 节中,我们介绍了专用引擎可以利用的特定应用程序特征。接下来,我们在第 4 节中概述了一些 H-store 设计。然后我们在第 5 节中介绍 H-Store 和 TPC-C 上流行的 RDBMS 的实验数据。我们在第 6 节中总结了本文,对 DBMS 社区的研究议程提出了一些激进的建议。

Section 2 of this paper explains the design considerations that can be exploited to achieve this factor of 82 on TPC-C. Then, in Section 3, we present specific application characteristics which can be leveraged by a specialized engine. Following that, we sketch some of the H-store design in Section 4. We then proceed in Section 5 to present experimental data on H-Store and a popular RDBMS on TPC-C. We conclude the paper in Section 6 with some radical suggestions for the research agenda for the DBMS community.

2   OLTP设计注意事项

2  OLTP Design Considerations

本节介绍了五个主要问题,H-Store 等新引擎可以利用这些问题来实现比当前 RDBMS 显着提高的性能。

This section presents five major issues, which a new engine such as H-Store can leverage to achieve dramatically better performance than current RDBMSs.

2.1  主存

2.1  Main Memory

20 世纪 70 年代末,大型机器的主内存约为 1 MB。如今,数 GB 已很常见,大型机器已接近 100 GB。几年后,1 TB 的主内存将不再罕见。想象一下一个由 20 个节点组成的无共享网格系统,每个节点现在有 32 GB 的主内存(很快将达到 100 GB),并且成本不到 50,000 美元。因此,任何大小小于 1 TB 的数据库现在或不久的将来都能够部署主内存。

In the late 1970’s a large machine had somewhere around a megabyte of main memory. Today, several Gbytes are common and large machines are approaching 100 Gbytes. In a few years a terabyte of main memory will not be unusual. Imagine a shared nothing grid system of 20 nodes, each with 32 Gbytes of main memory now, (soon to be 100 Gbytes), and costing less than $50,000. As such, any database less than a terabyte in size, is capable of main memory deployment now or in the near future.

绝大多数 OLTP 数据库的大小都小于 1 TB,并且大小增长非常缓慢。例如,TPC-C 每个物理配送中心(仓库)需要大约 100 MB 的内存,这是一个很能说明问题的说法。一个非常大的零售企业可能有 1000 个仓库,需要大约 100 GB 的存储空间,这符合我们的主内存部署范围。

The overwhelming majority of OLTP databases are less than 1 Tbyte in size and growing in size quite slowly. For example, it is a telling statement that TPC-C requires about 100 Mbytes per physical distribution center (warehouse). A very large retail enterprise might have 1000 warehouses, requiring around 100 Gbytes of storage, which fits our envelope for main memory deployment.

因此,我们认为 OLTP 应被视为主要内存市场,即使不是现在,也会在很短的几年内。因此,当前的 RDBMS 供应商针对主内存问题提供了面向磁盘的解决方案。总之,30 年的摩尔定律已经过时了 OLTP 应用程序的面向磁盘的关系架构。

As such, we believe that OLTP should be considered a main memory market, if not now then within a very small number of years. Consequently, the current RDBMS vendors have disk-oriented solutions for a main memory problem. In summary, 30 years of Moore’s law has antiquated the disk-oriented relational architecture for OLTP applications.

虽然市场上有一些主存数据库产品,例如TimesTen和SolidDB,但这些系统也继承了System R的包袱。这包括基于磁盘的恢复日志和动态锁定等功能,正如我们在以下各节中讨论的那样,它们会带来巨大的性能开销。

Although there are some main memory database products on the market, such as TimesTen and SolidDB, these systems inherit the baggage of System R as well. This includes such features as a disk-based recovery log and dynamic locking, which, as we discuss in the following sections, impose substantial performance overheads.

2.2  多线程和资源控制

2.2  Multi-threading and Resource Control

OLTP 事务非常轻量级。例如,TPC-C 中最重的事务读取大约 200 条记录。在主内存环境中,此类事务的有用工作在低端机器上消耗的时间不到一毫秒。另外,我们熟悉的大多数OLTP环境都不存在“用户卡顿”。例如,当亚马逊用户点击“购买”时,他会激活一个 OLTP 事务,该事务只会在完成时向用户报告。由于不存在磁盘操作和用户停顿,OLTP 事务的运行时间非常短。在这样的世界中,使用单线程执行模型运行事务中的每个 SQL 命令直至完成是有意义的,而不是支付并发执行语句之间的隔离开销。

OLTP transactions are very lightweight. For example, the heaviest transaction in TPC-C reads about 200 records. In a main memory environment, the useful work of such a transaction consumes less than one millisecond on a low-end machine. In addition, most OLTP environments we are familiar with do not have “user stalls”. For example, when an Amazon user clicks “buy it”, he activates an OLTP transaction which will only report back to the user when it finishes. Because of an absence of disk operations and user stalls, the elapsed time of an OLTP transaction is minimal. In such a world it makes sense to run each SQL command in a transaction to completion with a single-threaded execution model, rather than paying for the overheads of isolation between concurrently executing statements.

当前的 RDBMS 具有复杂的多线程系统,以尝试充分利用 CPU 和磁盘资源。这允许并行运行多对多查询。此外,它们还具有资源管理器来限制多道程序负载,以便其他资源(IP 连接、文件句柄、用于排序的主内存等)不会耗尽。这些功能在单线程执行模型中是不相关的。单线程系统中不需要资源调控器。

Current RDBMSs have elaborate multi-threading systems to try to fully utilize CPU and disk resources. This allows several-to-many queries to be running in parallel. Moreover, they also have resource governors to limit the multiprogramming load, so that other resources (IP connections, file handles, main memory for sorting, etc.) do not become exhausted. These features are irrelevant in a single threaded execution model. No resource governor is required in a single threaded system.

在单线程执行模型中,也没有理由使用多线程数据结构。因此,支持并发 B 树等所需的复杂代码可以完全删除。这会产生一个更可靠、性能更高的系统。

In a single-threaded execution model, there is also no reason to have multithreaded data structures. Hence the elaborate code required to support, for example, concurrent B-trees can be completely removed. This results in a more reliable system, and one with higher performance.

此时,有人可能会问“长时间运行的命令怎么样?” 在现实世界的 OLTP 系统中,没有任何事务,原因有两个:首先,似乎涉及长时间运行的事务的操作(例如用户在网上商店输入购买数据)通常会被分成多个事务来保存交易时间短。换句话说,良好的应用程序设计将使 OLTP 查询保持较小。其次,OLTP 系统不处理运行时间较长的临时查询;相反,此类查询将定向到针对此活动进行优化的数据仓库系统。OLTP系统没有理由解决非OLTP问题。这种想法只适用于“一刀切”的世界。

At this point, one might ask “What about long running commands?” In real-world OLTP systems, there aren’t any for two reasons: First, operations that appear to involve long-running transactions, such as a user inputting data for a purchase on a web store, are usually split into several transactions to keep transaction time short. In other words, good application design will keep OLTP queries small. Second, longer-running ad-hoc queries are not processed by the OLTP system; instead such queries are directed to a data warehouse system, optimized for this activity. There is no reason for an OLTP system to solve a non-OLTP problem. Such thinking only applies in a “one size fits all” world.

2.3 网格计算和叉车升级

2.3 Grid Computing and Fork-lift Upgrades

当前的 RDBMS 最初是为 20 世纪 70 年代流行的体系结构(即共享内存多处理器)编写的。1980 年代,共享磁盘架构由 Sun 和 HP 率先推出,大多数 DBMS 都经过扩展以包含该架构的功能。显然,未来十年将由无共享计算机系统(通常称为网格计算或刀片计算)占据主导地位。因此,任何 DBMS 都必须针对此配置进行优化。一个明显的策略是在网格的节点上水平划分数据,这种策略首先在 Gamma [ DGS+90 ] 中研究。

Current RDBMSs were originally written for the prevalent architecture of the 1970s, namely shared-memory multiprocessors. In the 1980’s shared disk architectures were spearheaded by Sun and HP, and most DBMSs were expanded to include capabilities for this architecture. It is obvious that the next decade will bring domination by shared-nothing computer systems, often called grid computing or blade computing. Hence, any DBMS must be optimized for this configuration. An obvious strategy is to horizontally partition data over the nodes of a grid, a tactic first investigated in Gamma [DGS+90].

此外,没有用户愿意执行“叉车式”升级。因此,任何新系统的架构都应考虑增量扩展。如果 N 个网格节点不能提供足够的马力,那么应该能够添加另外 K 个节点,产生一个具有 N+K 个节点的系统。此外,执行此升级时应该不会出现任何问题,即无需关闭 DBMS。这将消除每个系统管理员最可怕的噩梦;叉车式升级,需要完整的数据重新加载和切换。

In addition, no user wants to perform a “fork-lift” upgrade. Hence, any new system should be architected for incremental expansion. If N grid nodes do not provide enough horsepower, then one should be able to add another K nodes, producing a system with N+K nodes. Moreover, one should perform this upgrade, without a hiccup, i.e. without taking the DBMS down. This will eliminate every system administrator’s worst nightmare; a fork-lift upgrade with a requirement for a complete data reload and cutover.

要在不停机的情况下实现增量升级,需要现有系统所没有的重要功能。例如,必须能够在不停止事务的情况下将数据库的部分内容从一个站点复制到另一个站点。目前尚不清楚如何将这种功能连接到大多数现有系统上。然而,这可以成为新设计的要求并有效实现,正如 Vertica 1代码线中存在的这一功能所证明的那样。

To achieve incremental upgrade without going down requires significant capabilities, not found in existing systems. For example, one must be able to copy portions of a database from one site to another without stopping transactions. It is not clear how to bolt such a capability onto most existing systems. However, this can be made a requirement of a new design and implemented efficiently, as has been demonstrated by the existence of exactly this feature in the Vertica1 codeline.

2.4  高可用

2.4  High Availability

关系 DBMS 是在组织只有一台机器的时代(20 世纪 70 年代)设计的。如果出现故障,公司就会因系统不可用而蒙受损失。为了应对灾难,组织通常将日志磁带发送到异地。如果发生灾难,那么硬件供应商(通常是 IBM)将表现出英雄般的行为,在短短几天内交付新硬件并投入运行。运行日志磁带然后使系统恢复到接近灾难发生时的状态。

Relational DBMSs were designed in an era (1970s) when an organization had a single machine. If it went down, then the company lost money due to system unavailability. To deal with disasters, organizations typically sent log tapes off site. If a disaster occurred, then the hardware vendor (typically IBM) would perform heroics to get new hardware delivered and operational in small numbers of days. Running the log tapes then brought the system back to something approaching where it was when the disaster happened.

十年后的 20 世纪 80 年代,组织与灾难恢复服务(例如 Comdisco)签订了备份机器资源合同,因此日志磁带可以快速安装在远程备份硬件上。这一策略最大限度地减少了企业因灾难而停顿的时间。

A decade later in the 1980’s, organizations executed contracts with disaster recovery services, such as Comdisco, for backup machine resources, so the log tapes could be installed quickly on remote backup hardware. This strategy minimized the time that an enterprise was down as a result of a disaster.

如今,有许多组织在企业内部运行热备份,以便可以实现实时故障转移。或者,一些公司运行多个主站点,因此故障转移甚至更快。需要指出的是,企业更愿意为多个系统付费,以避免停机造成的严重财务后果,通常估计为每分钟数千美元。

Today, there are numerous organizations that run a hot standby within the enterprise, so that real-time failover can be accomplished. Alternately, some companies run multiple primary sites, so failover is even quicker. The point to be made is that businesses are much more willing to pay for multiple systems in order to avoid the crushing financial consequences of down time, often estimated at thousands of dollars per minute.

未来,我们认为高可用性和内置灾难恢复将成为 OLTP(和其他)市场的基本功能。从这个声明中可以得出一些明显的结论。首先,每个 OLTP DBMS 都需要保持多个副本的一致性,从而需要能够在地理上分散的系统网格上无缝运行。

In the future, we see high availability and built-in disaster recovery as essential features in the OLTP (and other) markets. There are a few obvious conclusions to be drawn from this statement. First, every OLTP DBMS will need to keep multiple replicas consistent, requiring the ability to run seamlessly on a grid of geographically dispersed systems.

其次,大多数现有的 RDBMS 供应商都将多机支持粘合到其原始 SMP 架构的顶部。相比之下,从系统底层开始支持无共享显然更高效。

Second, most existing RDBMS vendors have glued multi-machine support onto the top of their original SMP architectures. In contrast, it is clearly more efficient to start with shared-nothing support at the bottom of the system.

第三,支持无共享的最佳方法是在对等配置中使用多台计算机。这样可以将OLTP负载分散到多台机器上,并利用机器间复制来实现容错。这样,所有机器资源在正常操作期间都可用。故障只会导致资源减少而导致运行性能下降。相比之下,许多商业系统实施“热备用”,即第二台机器实际上闲置,等待第一台机器发生故障时接管。在这种情况下,正常操作只有一半的可用资源,这显然是更糟糕的解决方案。这些观点主张对 RDBMS 引擎进行彻底的重新设计,以便它们可以在新架构的内部实现点对点 HA。

Third, the best way to support shared nothing is to use multiple machines in a peer-to-peer configuration. In this way, the OLTP load can be dispersed across multiple machines, and inter-machine replication can be utilized for fault tolerance. That way, all machine resources are available during normal operation. Failures only cause degraded operation with fewer resources. In contrast, many commercial systems implement a “hot standby”, whereby a second machine sits effectively idle waiting to take over if the first one fails. In this case, normal operation has only half of the resources available, an obviously worse solution. These points argue for a complete redesign of RDBMS engines so they can implement peer-to-peer HA in the guts of a new architecture.

在HA系统中,无论是热备还是点对点,日志记录都可以大大简化。必须继续拥有撤消日志,以防事务失败并需要回滚。但是,撤消日志不必在事务完成后继续保留。因此,它可以是在事务提交时被丢弃的主存储器数据结构。永远不需要重做,因为这将通过远程站点的网络恢复来完成。当失效站点恢复活动时,可以从运行站点上的数据刷新它。

In an HA system, regardless of whether it is hot-standby or peer-to-peer, logging can be dramatically simplified. One must continue to have an undo log, in case a transaction fails and needs to roll back. However, the undo log does not have to persist beyond the completion of the transaction. As such, it can be a main memory data structure that is discarded on transaction commit. There is never a need for redo, because that will be accomplished via network recovery from a remote site. When the dead site resumes activity, it can be refreshed from the data on an operational site.

最近的一篇论文 [ LM06 ] 认为故障转移/重建与重做日志处理一样高效。因此,以这种方式操作基本上没有任何缺点。在 HA 世界中,我们会导致没有持久的重做日志,只有暂时的撤消日志。这极大地简化了恢复逻辑。它从 Aries 风格的 [ MHL+92 ] 日志系统转移到新功能,以便在故障站点恢复运行时从运行站点获取最新信息。

A recent paper [LM06] argues that failover/rebuild is as efficient as redo log processing. Hence, there is essentially no downside to operating in this manner. In an HA world, one is led to having no persistent redo log, just a transient undo one. This dramatically simplifies recovery logic. It moves from an Aries-style [MHL+92] logging system to new functionality to bring failed sites up to date from operational sites when they resume operation.

同样,大量复杂的代码已经过时,并且需要不同的功能。

Again, a large amount of complex code has been made obsolete, and a different capability is required.

2.5  无旋钮

2.5  No Knobs

当前的系统是在资源极其昂贵的时代构建的,每个计算系统都由一群穿着白大褂的巫师监视,负责系统的维护、供给、调整和优化。那个时代,电脑很贵,人很便宜。今天我们看到了相反的情况。人员成本是 IT 部门的主要支出。

Current systems were built in an era where resources were incredibly expensive, and every computing system was watched over by a collection of wizards in white lab coats, responsible for the care, feeding, tuning and optimization of the system. In that era, computers were expensive and people were cheap. Today we have the reverse. Personnel costs are the dominant expense in an IT shop.

因此,“自我一切”(自我修复、自我维护、自我调整等)系统是唯一的答案。然而,所有 RDBMS 都有大量复杂的调节旋钮,这些都是过去时代的遗留功能。真的; 所有供应商都在尝试提供自动设施,无需人工干预即可设置这些旋钮。然而,遗留代码永远无法删除功能。因此,“无旋钮”操作将是“人工旋钮”操作之外的操作,并导致更多的系统文档。此外,目前,我们所熟悉的 RDBMS 中的自动调优辅助功能所产生的系统性能远不及熟练 DBA 所能产生的性能。在当前系统中的调优工具变得更好之前,DBA 将进行调整。

As such “self-everything” (self-healing, self-maintaining, self-tuning, etc.) systems are the only answer. However, all RDBMSs have a vast array of complex tuning knobs, which are legacy features from a bygone era. True; all vendors are trying to provide automatic facilities which will set these knobs without human intervention. However, legacy code cannot ever remove features. Hence, “no knobs” operation will be in addition to “human knobs” operation, and result in even more system documentation. Moreover, at the current time, the automatic tuning aids in the RDBMSs that we are familiar with do not produce systems with anywhere near the performance that a skilled DBA can produce. Until the tuning aids get vastly better in current systems, DBAs will turn the knobs.

更好的答案是完全重新考虑调整过程并生产一个没有可见旋钮的新系统。

A much better answer is to completely rethink the tuning process and produce a new system with no visible knobs.

3  交易、处理和环境假设

3  Transaction, Processing and Environment Assumptions

如果假设一个系统网格具有主内存存储、内置高可用性、无用户停顿以及有用的事务工作时间低于 1 毫秒,那么以下结论将变得显而易见:

If one assumes a grid of systems with main memory storage, built-in high availability, no user stalls, and useful transaction work under 1 millisecond, then the following conclusions become evident:

1. 持久重做日志几乎肯定会成为一个重要的性能瓶颈。即使使用组提交,强制写入提交记录也会使每个事务的运行时间增加几毫秒。前面讨论的 HA/故障转移系统省去了这种昂贵的架构功能。

1.  A persistent redo log is almost guaranteed to be a significant performance bottleneck. Even with group commit, forced writes of commit records can add milliseconds to the runtime of each transaction. The HA/failover system discussed earlier dispenses with this expensive architectural feature.

2. 随着重做的消失,将事务传入和传出系统可能是下一个重要的瓶颈。JDBC/ODBC 样式接口的开销会很繁重,应该使用更有效的接口。特别是,我们提倡在数据库系统内部“进程中”运行应用程序逻辑(以存储过程的形式),而不是传统数据库客户端/服务器模型所隐含的进程间开销。

2.  With redo gone, getting transactions into and out of the system is likely to be the next significant bottleneck. The overhead of JDBC/ODBC style interfaces will be onerous, and something more efficient should be used. In particular, we advocate running application logic—in the form of stored procedures—“in process” inside the database system, rather than the interprocess overheads implied by the traditional database client / server model.

3. 应尽可能消除撤消日志,因为它也会成为一个重要的瓶颈。

3.  An undo log should be eliminated wherever practical, since it will also be a significant bottleneck.

4、应尽一切努力消除传统动态锁用于并发控制的成本,这也将是一个瓶颈。

4.  Every effort should be made to eliminate the cost of traditional dynamic locking for concurrency control, which will also be a bottleneck.

5. 与多线程数据结构相关的锁存可能很繁重。鉴于事务的运行时间很短,转向单线程执行模型将消除这种开销,同时性能损失很小。

5.  The latching associated with multi-threaded data structures is likely to be onerous. Given the short runtime of transactions, moving to a single threaded execution model will eliminate this overhead at little loss in performance.

6. 应尽可能避免分布式事务的两阶段提交协议,因为 2PC 中的往返通信所造成的网络延迟通常约为毫秒。

6.  One should avoid a two-phase commit protocol for distributed transactions, wherever possible, as network latencies imposed by round trip communications in 2PC often take on the order of milliseconds.

我们消除并发控制、提交处理和撤消日志记录的能力取决于 OLTP 模式和事务工作负载的几个特征,这是我们现在讨论的主题。

Our ability to remove concurrency control, commit processing and undo logging depends on several characteristics of OLTP schemas and transaction workloads, a topic to which we now turn.

3.1  事务和模式特征

3.1  Transaction and Schema Characteristics

H-Store需要预先指定完整的工作负载,由事务的集合组成。每个类都包含具有相同 SQL 语句和程序逻辑的事务,但各个事务使用的运行时常量有所不同。由于假设 OLTP 系统中没有临时事务,因此这似乎不是一个不合理的要求。此类事务类必须提前向 H-Store 注册,并且如果它们包含用户停顿,则将被禁止(事务可能因其他原因而包含停顿 - 例如,在分布式设置中,一台机器必须等待另一台机器处理请求。 )类似地,H-Store 还假设事务操作所用的表的集合(逻辑模式)是预先已知的。

H-Store requires the complete workload to be specified in advance, consisting of a collection of transaction classes. Each class contains transactions with the same SQL statements and program logic, differing in the run-time constants used by individual transactions. Since there are assumed to be no ad-hoc transactions in an OLTP system, this does not appear to be an unreasonable requirement. Such transaction classes must be registered with H-Store in advance, and will be disallowed if they contain user stalls (transactions may contain stalls for other reasons—for example, in a distributed setting where one machine must wait for another to process a request.) Similarly, H-Store also assumes that the collection of tables (logical schema) over which the transactions operate is known in advance.

我们观察到,在许多 OLTP 工作负载中,除了一个称为根的表之外,每个表都只有一个连接项,该连接项与其祖先是 1-n 关系。因此,该模式是一棵1-n 关系的。我们将此类模式表示为树模式。这种模式很流行;例如,客户生成订单,其中包含行项目和履行计划。树模式对网格中的节点有明显的水平分区。具体来说,根表可以根据主键进行范围或散列分区。每个后代表都可以进行分区,以便树中的所有等连接仅跨越一个站点。在接下来的讨论中,我们将考虑树模式和非树模式。

We have observed that in many OLTP workloads every table except a single one called the root, has exactly one join term which is a 1-n relationship to its ancestor. Hence, the schema is a tree of 1-n relationships. We denote this class of schemas as tree schemas. Such schemas are popular; for example, customers produce orders, which have line items and fulfillment schedules. Tree schemas have an obvious horizontal partitioning over the nodes in a grid. Specifically, the root table can be range or hash partitioned on the primary key(s). Every descendent table can be partitioned such that all equi-joins in the tree span only a single site. In the discussion to follow, we will consider both tree and non-tree schemas.

在树模式中,假设每个事务类中的每个命令在根节点的主键上都具有相等谓词(例如,在电子商务应用程序中,许多命令将以特定客户为根,因此将包括诸如 ) 之类的谓词customer_id = 27。使用上面讨论的水平分区,很明显,在这种情况下,每个事务中的每个 SQL 命令都是一个站点的本地命令。此外,如果每个事务类中的每个命令都仅限于同一个站点,那么我们将该应用程序称为约束树应用程序 (CTA)。CTA 应用程序具有一个宝贵的功能,即每笔交易都可以在单个站点上运行完成。这种单一站点的价值事务,正如第 4.3 节中将讨论的那样,事务可以在与另一个网格站点通信时没有任何停顿的情况下执行(但是,在某些情况下,副本必须同步,以便事务以相同的顺序执行)。

In a tree schema, suppose every command in every transaction class has equality predicates on the primary key(s) of the root node (for example, in an e-commerce application, many commands will be rooted with a specific customer, so will include predicates like customer_id = 27). Using the horizontal partitioning discussed above, it is clear that in this case every SQL command in every transaction is local to one site. If, in addition, every command in each transaction class is limited to the same single site, then we call the application a constrained tree application (CTA). A CTA application has the valuable feature that every transaction can be run to completion at a single site. The value of such single-sited transactions, as will be discussed in Section 4.3, is that transactions can execute without any stalls for communication with another grid site (however, in some cases, replicas will have to synchronize so that transactions are executed in the same order).

如果 CTA 的每个事务中的每个命令除了根上的相等谓词之外还指定一个或多个直接后代节点的主键上的相等匹配,则树模式的分区可以分层扩展以包括这些直接后代节点。在这种情况下,如果需要,可以使用更细粒度的划分。

If every command in every transaction of a CTA specifies an equality match on the primary key(s) of one or more direct descendent nodes in addition to the equality predicate on the root, then the partitioning of a tree schema can be extended hierarchically to include these direct descendent nodes. In this case, a finer granularity partitioning can be used, if desired.

CTA 是一类重要的单站点应用程序,可以非常高效地执行。我们在大公司多年设计数据库应用程序的经验表明,OLTP 应用程序通常明确设计为 CTA,或者分解为 CTA 通常是可能的 [ Hel07 ]。除了简单地论证 CTA 很普遍之外,我们还对可用于使非 CTA 应用程序成为单站点的技术感兴趣;精确描述可能发生这种情况的情况是一个有趣的研究问题。我们提到了两种可以在这里系统应用的可能的模式转换。

CTAs are an important class of single-sited applications which can be executed very efficiently. Our experience with many years of designing database applications in major corporations suggests that OLTP applications are often designed explicitly to be CTAs, or that decompositions to CTAs are often possible [Hel07]. Besides simply arguing that CTAs are prevalent, we are also interested in techniques that can be used to make non-CTA applications single-sited; it is an interesting research problem to precisely characterize the situations in which this is possible. We mention two possible schema transformations that can be systematically applied here.

首先,考虑模式中的所有只读表,即不由任何事务类更新的表。这些表可以在所有站点复制。如果应用程序成为 CTA,并从考虑中删除了这些表,则在复制只读表后,应用程序将成为单站点。

First, consider all of the read-only tables in the schema, i.e. ones which are not updated by any transaction class. These tables can be replicated at all sites. If the application becomes CTA with these tables removed from consideration, then the application becomes single-sited after replication of the read-only tables.

另一类重要的应用程序是一次性的。这些应用程序具有以下特性:它们的所有事务都可以并行执行,而不需要在站点之间传送中间结果。此外,后续命令中永远不需要先前 SQL 查询的结果。在这种情况下,每个事务都可以分解为单站点计划的集合,这些计划可以分派到适当的站点来执行。

Another important class of applications are one-shot. These applications have the property that all of their transactions can be executed in parallel without requiring intermediate results to be communicated among sites. Moreover, the result of previous SQL queries are never required in subsequent commands. In this case, each transaction can be decomposed into a collection of single-site plans which can be dispatched to the appropriate sites for execution.

应用程序通常可以通过站点之间的表垂直分区来一次性制作(复制未更新的列);例如,TPC-C 就是如此(正如我们在第 5 节中讨论的那样。)

Applications can often be made one-shot with vertical partitioning of tables amongst sites (columns that are not updated are replicated); this is true of TPC-C, for example (as we discuss in Section 5.)

一些事务类是两阶段的(或者可以做成两阶段的)。在第一阶段中有一组只读操作。根据这些查询的结果,交易可能会被中止。第二阶段包括一系列查询和更新,其中不可能出现完整性违规。H-Store将利用两阶段特性来消除undo log。我们观察到许多事务,包括 TPC-C 中的事务,都是两阶段的。

Some transaction classes are two-phase (or can be made to be two phase.) In phase one there are a collection of read-only operations. Based on the result of these queries, the transaction may be aborted. Phase two then consists of a collection of queries and updates where there can be no possibility of an integrity violation. H-Store will exploit the two-phase property to eliminate the undo log. We have observed that many transactions, including those in TPC-C, are two-phase.

如果事务类是两阶段的,并且另外具有所有副本上的第 1 阶段操作导致所有副本站点中止或全部继续的属性,则该事务类是强两阶段的。

A transaction class is strongly two-phase if it is two-phase and additionally has the property that phase 1 operations on all replicas result in all replica sites aborting or all continuing.

此外,对于每个交易类,我们找到其成员与指定类的成员进行交换的所有其他类。我们对交换性的具体定义是:

Additionally, for every transaction class, we find all other classes whose members commute with members of the indicated class. Our specific definition of commutativity is:

当来自相同或不同类的两个并发事务的单站点子计划的任何交错产生与任何其他交错相同的最终数据库状态时(假设两个事务都提交),它们就会进行交换。

Two concurrent transactions from the same or different classes commute when any interleaving of their single-site sub-plans produces the same final database state as any other interleaving (assuming both transactions commit).

与所有事务类(包括其自身)进行通信的事务类将被称为“无菌”

A transaction class which commutes with all transaction classes (including itself) will be termed sterile.

我们在 H-Store 算法中使用单点、无菌、两相和强两相特性,如下所示。根据我们在主要商业在线零售应用程序方面的经验,我们已经确定这些属性特别相关,并且相信它们将在许多现实世界环境中找到。

We use single-sited, sterile, two-phase, and strong two-phase properties in the H-Store algorithms, which follow. We have identified these properties as being particularly relevant based on our experience with major commercial online retail applications, and are confident that they will be found in many real world environments.

4   H店草图

4  H-Store Sketch

在本节中,我们将描述 H-Store 如何利用前面描述的属性来实现非常高效的 OLTP 数据库。

In this section, we describe how H-Store exploits the previously described properties to implement a very efficient OLTP database.

4.1  系统架构

4.1  System Architecture

H-Store 在计算机网格上运行。所有对象都划分在网格的节点上。与 C-Store [ SAB+05 ] 一样,用户可以指定他希望拥有的 K-安全级别。

H-Store runs on a grid of computers. All objects are partitioned over the nodes of the grid. Like C-Store [SAB+05], the user can specify the level of K-safety that he wishes to have.

在网格中的每个站点,表行都使用传统的 B 树索引连续放置在主内存中。B 树块大小根据所使用的机器上 L2 高速缓存行的宽度进行调整。尽管传统的 B 树可以通过缓存感知变化 [ RR99RR00 ] 来击败,但我们认为只有当索引代码最终成为显着的性能瓶颈时才需要执行此优化。

At each site in the grid, rows of tables are placed contiguously in main memory, with conventional B-tree indexing. B-tree block size is tuned to the width of an L2 cache line on the machine being used. Although conventional B-trees can be beaten by cache conscious variations [RR99, RR00], we feel that this is an optimization to be performed only if indexing code ends up being a significant performance bottleneck.

每个 H-Store 站点都是单线程的,并且不间断地执行传入的 SQL 命令直至完成。每个站点都分解为多个逻辑站点,每个可用核心对应一个逻辑站点。每个逻辑站点都被视为独立的物理站点,具有自己的索引和元组存储。物理站点上的主存储器在逻辑站点之间进行分区。这样,每个逻辑站点都有一个专用的CPU并且是单线程的。

Every H-Store site is single threaded, and performs incoming SQL commands to completion, without interruption. Each site is decomposed into a number of logical sites, one for each available core. Each logical site is considered an independent physical site, with its own indexes and tuple storage. Main memory on the physical site is partitioned among the logical sites. In this way, every logical site has a dedicated CPU and is single threaded.

在 OLTP 环境中,大多数应用程序使用存储过程来减少应用程序和 DBMS 之间的往返次数。因此,H-Store 只有一种 DBMS 功能,即执行预定义的事务(事务可以从任何站点发出):

In an OLTP environment most applications use stored procedures to cut down on the number of round trips between an application and the DBMS. Hence, H-Store has only one DBMS capability, namely to execute a predefined transaction (transactions may be issued from any site):

Execute transaction (parameter_list)

Execute transaction (parameter_list)

在当前的原型中,存储过程是用 C++ 编写的,尽管我们在第 6 节中对更好的语言提出了建议。我们的实现将应用程序逻辑与在同一进程中直接操作数据库混合在一起;这提供了与在单个存储过程中运行整个应用程序相当的性能,其中 SQL 调用作为本地过程调用(不是 JDBC)进行,并且数据在共享数据数组中返回(同样不是 JDBC)。与 C-Store 一样,没有重做日志,并且仅在需要时才写入撤消日志,如 4.4 节中所述。如果写入,撤消日志将驻留在主内存中,并在事务提交时被丢弃。

In the current prototype, stored procedures are written in C++, though we have suggestions on better languages in Section 6. Our implementation mixes application logic with direct manipulation of the database in the same process; this provides comparable performance to running the whole application inside a single stored procedure, where SQL calls are made as local procedure calls (not JDBC) and data is returned in a shared data array (again not JDBC). Like C-Store there is no redo log, and an undo log is written only if required, as discussed in Section 4.4. If written, the undo log is main memory resident, and discarded on transaction commit.

4.2  查询执行

4.2  Query Execution

我们期望构建一个传统的基于成本的查询优化器,它在事务定义时为事务类中的 SQL 命令生成查询计划。我们相信这个优化器可以相当简单,因为 6 路连接在 OLTP 环境中永远不会完成。如果发生多路联接,它们总是会识别感兴趣的唯一元组(例如采购订单编号),然后识别联接到该记录的元组(例如行项目)。因此,我们总是从一个锚元组开始,通过少量的 1 对n连接到最终感兴趣的元组。GROUP BY 和聚合很少出现在 OLTP 环境中。当然,最终结果是一个简单的查询执行计划。

We expect to build a conventional cost-based query optimizer which produces query plans for the SQL commands in transaction classes at transaction definition time. We believe that this optimizer can be rather simple, as 6 way joins are never done in OLTP environments. If multi-way joins occur, they invariably identify a unique tuple of interest (say a purchase order number) and then the tuples that join to this record (such as the line items). Hence, invariably one proceeds from an anchor tuple through a small number of 1-to-n joins to the tuples of ultimate interest. GROUP BY and aggregation rarely occur in OLTP environments. The net result is, of course, a simple query execution plan.

事务中所有命令的查询执行计划可能是:

The query execution plans for all commands in a transaction may be:

单站点:在这种情况下,可以将计划集合分派到适当的站点来执行。

Single-sited: In this case the collection of plans can be dispatched to the appropriate site for execution.

一次性:在这种情况下,所有事务都可以分解为一组仅在单个站点执行的计划。

One shot: In this case, all transactions can be decomposed into a set of plans that are executed only at a single site.

一般情况:一般情况下,会有一些命令需要在网格中的站点之间传递中间结果。另外,可能有一些命令的运行时参数是从以前的命令中获取的。在这种情况下,我们需要在事务进入系统的站点上执行标准 Gamma 式运行时模型,与数据所在站点的工作人员进行通信。

General: In the general case, there will be commands which require intermediate results to be communicated among sites in the grid. In addition, there may be commands whose run-time parameters are obtained from previous commands. In this case, we need the standard Gamma-style run time model of an execution supervisor at the site where the transaction enters the system, communicating with workers at the sites where data resides.

对于一般事务,我们将事务类的深度计算为计划集合中必须在站点之间发送消息的次数。

For general transactions, we compute the depth of the transaction class to be the number of times in the collection of plans, where a message must be sent between sites.

4.3  数据库设计器

4.3  Database Designer

为了实现无旋钮操作,H-Store将构建一个自动物理数据库设计器,它将指定水平分区、复制位置和索引字段。

To achieve no-knobs operation, H-Store will build an automatic physical database designer which will specify horizontal partitioning, replication locations, and indexed fields.

与 C-Store 不同的是,C-Store 假设了一个适合以读为主的环境中的重叠物化视图的世界,H-Store 实现了用户指定的表,并使用用户指定的表的标准复制来实现 HA。大多数表将在网格中的所有节点上水平分区。为了实现HA,这样的表片段必须有一个或多个伙伴,它们包含完全相同的信息,可能使用不同的物理表示(例如,排序顺序)来存储。

In contrast to C-Store which assumed a world of overlapping materialized views appropriate in a read-mostly environment, H-Store implements the tables specified by the user and uses standard replication of user-specified tables to achieve HA. Most tables will be horizontally partitioned across all of the nodes in a grid. To achieve HA, such table fragments must have one or more buddies, which contain exactly the same information, possibly stored using a different physical representation (e.g., sort order).

数据库设计者的目标是使尽可能多的事务类成为单站点。采用的策略与 C-Store [ SAB+05]。该系统为仓库环境中无所不在的星形或雪花模式构建了自动设计,现在正在将这些算法推广到“接近雪花”的模式。同样,H-Store会针对OLTP环境中的常见情况(约束树应用)构建自动设计,并会使用前面提到的基于根表主键跨站点分区数据库并分配其他表元组的策略到基于它们的根元组的站点。我们还将探索扩展,例如第 3 节中提到的只读表和垂直分区的优化。这是一项研究任务,看看这种方法可以推进到什么程度以及它会取得多大成功。

The goal of the database designer is to make as many transaction classes as possible single-sited. The strategy to be employed is similar to the one used by C-Store [SAB+05]. That system constructed automatic designs for the omnipresent star or snowflake schemas in warehouse environments, and is now in the process of generalizing these algorithms for schemas that are “near snowflakes”. Similarly, H-Store will construct automatic designs for the common case in OLTP environments (constrained tree applications), and will use the previously mentioned strategy of partitioning the database across sites based on the primary key of the root table and assigning tuples of other tables to sites based on root tuples they descend from. We will also explore extensions, such as optimizations for read-only tables and vertical partitioning mentioned in Section 3. It is a research task to see how far this approach can be pushed and how successful it will be.

同时,水平分区和索引选项可以由知识丰富的用户手动指定。

In the meantime, horizontal partitioning and indexing options can be specified manually by a knowledgeable user.

4.4  事务管理、复制和恢复

4.4  Transaction Management, Replication and Recovery

由于 H-Store 实现每个表的两个(或更多)副本,因此必须以事务方式更新副本。这是通过将每个 SQL 读取命令定向到任何副本并将每个 SQL 更新定向到所有副本来实现的。

Since H-Store implements two (or more) copies of each table, replicas must be transactionally updated. This is accomplished by directing each SQL read command to any replica and each SQL update to all replicas.

此外,每个事务在进入 H-Store 时都会收到一个时间戳,该时间戳由 (site_id, local_unique_timestamp) 对组成。给定站点的排序,时间戳是唯一的并形成总顺序。我们假设使用 NTP [ Mil89 ]等算法,生成本地时间戳的本地时钟彼此几乎保持同步。

Moreover, every transaction receives a timestamp on entry to H-Store, which consists of a (site_id, local_unique_timestamp) pair. Given an ordering of sites, timestamps are unique and form a total order. We assume that the local clocks which generate local timestamps are kept nearly in sync with each other, using an algorithm like NTP [Mil89].

H-Store 利用多种情况来简化并发控制和提交协议。

There are multiple situations which H-Store leverages to streamline concurrency control and commit protocols.

单点/一次射击。如果所有事务类都是单站点或一次性的,则可以将单个事务分派到正确的副本站点并在那里执行直至完成。除非所有事务类都是无菌的,否则每个执行站点必须等待一小段时间(旨在考虑网络延迟)以接收来自其他发起者的事务,以便按照时间戳顺序执行。通过少量增加延迟,所有副本将以相同的顺序更新;在局域网中,最大延迟将是亚毫秒级。这将保证每个副本的结果相同。因此,副本之间不会出现数据不一致的情况。此外,所有副本都将提交,否则所有副本将中止。因此,每个事务都可以在本地提交或中止,并确信在其他副本上会发生相同的结果。

Single-sited/one shot. If all transaction classes are single-sited or one-shot, then individual transaction can be dispatched to the correct replica sites and executed to completion there. Unless all transaction classes are sterile, each execution site must wait a small period of time (meant to account for network delays) for transactions arriving from other initiators, so that the execution is in timestamp order. By increasing latency by a small amount, all replicas will by updated in the same order; in a local area network, maximum delays will be sub-millisecond. This will guarantee the identical outcome at each replica. Hence, data inconsistency between the replicas cannot occur. Also, all replicas will commit or all replicas will abort. Hence, each transaction can commit or abort locally, confident that the same outcome will occur at the other replicas. There is no redo log, no concurrency control, and no distributed commit processing.

两相。不需要撤消日志。因此,如果与上述属性相结合,则根本不需要交易设施。

Two-phase. No undo-log is required. Thus, if combined with the above properties, no transaction facilities are required at all.

无菌。如果所有事务类都是无菌的,则执行可以在没有并发控制的情况下正常进行。此外,无需在所有副本上以相同的顺序发出时间戳和执行事务。但是,如果查询处理涉及多个站点,则不能保证所有站点都会中止或所有站点都会继续。在这种情况下,工作人员必须在第一阶段结束时响应“中止”或“继续”,并且执行主管必须将此信息传达给工作人员站点。因此,标准提交分布式处理必须在第一阶段结束时完成。如果事务是强两阶段的,则可以避免这种额外的开销。

Sterile. If all transaction classes are sterile, then execution can proceed normally with no concurrency control. Further, the need to issue timestamps and execute transactions in the same order on all replicas is obviated. However, if multiple sites are involved in query processing, then there is no guarantee that all sites will abort or all sites will continue. In this case, workers must respond “abort” or “continue” at the end of the first phase, and the execution supervisor must communicate this information to worker sites. Hence, standard commit distributed processing must be done at the end of phase one. This extra overhead can be avoided if the transaction is strongly two-phase.

其他情况。对于其他情况(非无菌、非单站点、非一次性),我们需要承受某种并发控制方案的开销。我们所有的 RDBMS熟悉使用动态锁定来实现事务一致性。这一决定是在 1980 年代 [ ACL87 ] 的开创性模拟工作之后做出的,该模拟工作表明锁定比其他替代方案效果更好。然而,我们认为动态锁定对于 H-Store 来说是一个糟糕的选择,原因如下:

Other cases. For other cases (non-sterile, non-single-sited, non one-shot), we need to endure the overhead of some sort of concurrency control scheme. All RDBMSs we are familiar with use dynamic locking to achieve transaction consistency. This decision followed pioneering simulation work in the 1980’s [ACL87] that showed that locking worked better than other alternatives. However, we believe that dynamic locking is a poor choice for H-Store for the following reasons:

1. 交易的寿命非常短暂。没有用户停顿,也没有磁盘活动。因此,交易的活跃时间非常短。与动态锁定等悲观方法相比,这有利于乐观方法。其他人,例如在内存模型中使用事务的架构师和编程语言设计者[ HM93 ],也得出了相同的结论。

1.  Transactions are very short-lived. There are no user-stalls and no disk activity. Hence, transactions are alive for very short time periods. This favors optimistic methods over pessimistic methods, like dynamic locking. Others, for example architects and programming language designers using transactions in memory models [HM93], have reached the same conclusion.

2. 每个事务都被分解为子命令的集合,这些子命令是给定站点的本地命令。如前所述,子命令集合在每个站点以单线程方式运行。同样,这会导致没有锁存等待、更短的总执行时间,并且再次有利于更乐观的方法。

2.  Every transaction is decomposed into collections of sub-commands, which are local to a given site. As noted earlier, the collection of sub commands are run in a single threaded fashion at each site. Again, this results in no latch waits, smaller total execution times, and again favors more optimistic methods.

3. 我们假设我们提前收到了整个交易类集合。该信息可以发挥优势,就像 1970 年代 [ BSR80 ]的 SDD-1 方案等系统以前所做的那样,以减少并发控制开销。

3.  We assume that we receive the entire collection of transaction classes in advance. This information can be used to advantage, as has been done previously by systems such as the SDD-1 scheme from the 1970’s [BSR80] to reduce the concurrency control overhead.

4. 在设计良好的系统中,事务冲突和死锁都非常。这些情况会降低性能,并且应用程序设计人员总是会修改工作负载以消除它们。因此,应该针对“无碰撞”的情况进行设计,而不是使用悲观的方法。

4.  In a well designed system there are very few transaction collisions and very very few deadlocks. These situations degrade performance and the workload is invariably modified by application designers to remove them. Hence, one should design for the “no collision” case, rather than using pessimistic methods.

H-Store方案利用了这些因素。

The H-Store scheme takes advantage of these factors.

每个(非无菌、非单站点、非一次性)事务类都有一组事务类,它可能与其发生冲突,并到达网格中的某个站点并与该站点的事务协调器交互。事务协调员充当到达站点的执行主管,并将子计划片段发送到各个站点。工作站点接收子计划并等待上述相同的一小段时间,以等待其他可能存在冲突且具有较低时间戳的事务到达。然后,工人:

Every (non-sterile, non single-sited, non one-shot) transaction class has a collection of transaction classes with which it might conflict and arrives at some site in the grid and interacts with a transaction coordinator at that site. The transaction coordinator acts as the execution supervisor at the arrival site and sends out the subplan pieces to the various sites. A worker site receives a subplan and waits for the same small period of time mentioned above for other possibly conflicting transactions with lower timestamps to arrive. Then, the worker:

• 如果在其站点上没有未提交的、具有较低时间戳的潜在冲突事务,则执行子计划,然后将其输出数据发送到需要它的站点,该站点可能是中间站点或事务协调器。

•  Executes the subplan, if there is no uncommitted, potentially conflicting transaction at his site with a lower timestamp, and then sends his output data to the site requiring it, which may be an intermediate site or the transaction coordinator.

• 否则向协调器发出中止消息

•  Issues an abort to the coordinator otherwise

如果协调器从所有站点收到“ok”,它会通过发布下一个子计划集合(可能散布有 C++ 逻辑)来继续事务。如果没有更多子计划,则提交事务。否则,它会中止。

If the coordinator receives an “ok” from all sites, it continues with the transaction by issuing the next collection of subplans, perhaps with C++ logic interspersed. If there are no more subplans, then it commits the transaction. Otherwise, it aborts.

上述算法就是基本的H-Store策略。在执行期间,事务监视器会监视成功事务的百分比。如果中止次数太多,H-Store 会动态地转向以下更复杂的策略。

The above algorithm is the basic H-Store strategy. During execution, a transaction monitor watches the percentage of successful transactions. If there are too many aborts, H-Store dynamically moves to the following more sophisticated strategy.

如上所述,在执行或中止子计划之前,每个工作站点都会停止大约 的时间长度,以查看MaxD * average_round_trip_message_delay是否出现具有较早时间戳的子计划。如果是这样,工作站点就会正确地对子计划进行排序,从而降低中止的可能性。MaxD 是冲突事务类的最大深度。

Before executing or aborting the subplan, noted above, each worker site stalls by a length of time approximated by MaxD * average_round_trip_message_delay to see if a subplan with an earlier timestamp appears. If so, the worker site correctly sequences the subplans, thereby lowering the probability of abort. MaxD is the maximum depth of a conflicting transaction class.

这种中间策略降低了中止概率,但代价是增加了一些毫秒的延迟。我们目前正在运行模拟来演示在什么情况下这可以提高性能。

This intermediate strategy lowers the abort probability, but at a cost of some number of msecs of increased latency. We are currently running simulations to demonstrate the circumstances under which this results in improved performance.

我们最后的高级策略会跟踪每个站点上每个事务的读取集和写入集。在这种情况下,工作站点运行每个子计划,然后根据标准乐观并发控制规则在必要时中止子计划。在簿记方面有一些额外的开销,并且在中止时丢弃额外的工作,可以进一步降低冲突的可能性。同样,正在进行模拟以确定何时这是一个获胜策略。

Our last advanced strategy keeps track of the read set and write set of each transaction at each site. In this case, a worker site runs each subplan, and then aborts the subplan if necessary according to standard optimistic concurrency control rules. At some extra overhead in bookkeeping and additional work discarded on aborts, the probability of conflict can be further reduced. Again, simulations are in progress to determine when this is a winning strategy.

综上,我们的H-Store并发控制算法是:

In summary, our H-Store concurrency control algorithm is:

• 在没有控制的情况下运行无菌、单站点、一次性交易

•  Run sterile, single-sited and one-shot transactions with no controls

• 其他交易按照基本策略运行

•  Other transactions are run with the basic strategy

• 如果中止次数过多,则升级至中间策略

•  If there are too many aborts, escalate to the intermediate strategy

• 如果仍然有太多中止,请进一步升级到高级策略。

•  If there are still too many aborts, further escalate to the advanced strategy.

需要注意的是,该策略是一种复杂的乐观并发控制方案。乐观方法之前已被广泛研究[KR81,ACL87 ]。此外,Ants DBMS [ Ants07 ] 利用交换性来降低锁定成本。因此,本节应被视为已知技术的非常低开销的整合。

It should be noted that this strategy is a sophisticated optimistic concurrency control scheme. Optimistic methods have been extensively investigated previously [KR81, ACL87]. Moreover, the Ants DBMS [Ants07] leverages commutativity to lower locking costs. Hence, this section should be considered as a very low overhead consolidation of known techniques.

请注意,我们尚未采用任何复杂的调度技术来减少冲突。例如,可以运行所有交易类对的示例并记录冲突频率。然后,调度程序可以考虑此信息,并尝试避免将事务与高冲突概率一起运行。

Notice that we have not yet employed any sophisticated scheduling techniques to lower conflict. For example, it is possible to run examples from all pairs of transaction classes and record the conflict frequency. Then, a scheduler could take this information into account, and try to avoid running transactions together with a high probability of conflict.

图像

图 1   TPC-C 架构(转载自 TPC-C 规范版本 5.8.0,第 10 页)

Figure 1  TPC-C Schema (reproduced from the TPC-C specification version 5.8.0, page 10)

下一节将展示这些技术和 H-Store 设计的其余部分如何在 TPC-C 上工作。

The next section shows how these techniques and the rest of the H-Store design works on TPC-C.

5  性能比较

5  A Performance Comparison

TPC-C 在图 1所示的模式上运行,并包含 5 个事务类 ( new_order, payment, order status, delivery, and stock_level)。

TPC-C runs on the schema diagramed in Figure 1, and contains 5 transaction classes (new_order, payment, order status, delivery, and stock_level).

由于篇幅限制,我们不会包含这些交易的代码;感兴趣的读者可以参考 TPC-C 规范 [ TPCC ]。表 1总结了他们的行为。

Because of space limitations, we will not include the code for these transactions; the interested reader is referred to the TPC-C specification [TPCC]. Table 1 summarizes their behavior.

TPC-C 的高效 H-Store 实施有三种可能的策略。首先,我们可以在单核、单 CPU 机器上运行。这会自动使每个事务类成为单站点,并且每个事务都可以在单线程环境中运行完成。成对的 HA 站点将实现相同的执行顺序,因为正如稍后将看到的,所有事务类别都可以强两阶段,这意味着所有事务要么在两个站点上成功,要么在两个站点上中止。因此,在具有配对 HA 站点的单个站点上,无需任何开销即可实现 ACID 属性。另外两种策略用于多核和/或多 CPU 系统上的并行操作。它们涉及使工作负载变得无菌或一次性,正如我们在上一节中讨论的那样,足以让我们在没有传统并发控制的情况下运行查询。为此,我们需要对 TPC-C 工作负载执行一些技巧;在描述这一点之前,我们首先解决数据分区。

There are three possible strategies for an efficient H-Store implementation of TPC-C. First, we could run on a single core, single CPU machine. This automatically makes every transaction class single-sited, and each transaction can be run to completion in a single-threaded environment. The paired-HA site will achieve the same execution order, since, as will be seen momentarily, all transaction classes can be made strongly two-phase, meaning that all transactions will either succeed at both sites or abort at both sites. Hence, on a single site with a paired HA site, ACID properties are achieved with no overhead whatsoever. The other two strategies are for parallel operation on multi-core and/or multi-CPUs systems. They involve making the workload either sterile or one-shot, which, as we discussed in the previous section, are sufficient to allow us to run queries without conventional concurrency control. To do this, we will need to perform some trickery with the TPC-C workload; before describing this, we first address data partitioning.

表 1 TPC-C 事务类别

Table 1 TPC-C Transaction Classes

new_order

new_order

为客户下订单。90%的订单可由客户“自家”仓库的库存全额供应;10% 需要访问属于远程仓库的库存。读/写事务。没有最低混合百分比要求,但大约 50% 的交易是 new_order 交易。

Place an order for a customer. 90% of all orders can be supplied in full by stocks from the customer’s “home” warehouse; 10% need to access stock belonging to a remote warehouse. Read/write transaction. No minimum percentage of mix required, but about 50% of transactions are new_order transactions.

Payment

Payment

更新客户的余额和仓库/区域销售字段。85% 的更新发送到客户的家庭仓库;15% 到远程仓库。读/写事务。必须至少占交易组合的 43%。

Updates the customer’s balance and warehouse/district sales fields. 85% of updates go to customer’s home warehouse; 15% to a remote warehouse. Read/write transaction. Must be at least 43% of transaction mix.

order_ status

order_ status

查询客户上次订单的状态。只读。必须至少占交易组合的 4%。

Queries the status of a customer’s last order. Read only. Must be at least 4% of transaction mix.

Delivery

Delivery

选择一个仓库,为10个区中的每一个“交付”一个订单,这意味着从新订单表中删除一条记录并更新客户的帐户余额。每次交付都可以是一个单独的交易;必须至少占交易组合的 4%。

Select a warehouse, and for each of 10 districts “deliver” an order, which means removing a record from the new-order table and updating the customer’s account balance. Each delivery can be a separate transaction; Must be at least 4% of transaction mix.

stock_level

stock_level

查找库存水平低于阈值的商品;只读,必须读取已提交的数据,但不需要可串行化。必须至少占交易组合的 4%。

Finds items with a stock level below a threshold; read only, must read committed data but does not need serializability. Must be at least 4% of transaction mix.

TPC-C 不是树形结构模式。Item 表的存在以及 Order-line 与 Stock 的关系使其成为非树模式。然而,项目表是只读的,可以在每个站点复制。订单行表可以根据每个站点的仓库进行分区。通过这样的复制和分区,模式被分解,使得每个站点都有以仓库的不同分区为根的记录子集。这将被称为分区和复制的基本H-Store 策略。

TPC-C is not a tree-structured schema. The presence of the Item table as well as the relationship of Order-line with Stock make it a non-tree schema. The Item table, however, is read-only and can be replicated at each site. The Order-line table can be partitioned according to Warehouse to each site. With such replication and partitioning, the schema is decomposed such that each site has a subset of the records rooted at a distinct partition of the warehouses. This will be termed the basic H-Store strategy for partitioning and replication.

5.1  查询类

5.1  Query Classes

除此以外的所有事务类都new_order已经是两阶段的,因为它们永远不需要中止。New_order可能需要中止,因为其输入可能包含无效的项目编号。不过,TPC-C规范中是允许的在事务开始时对每个商品编号运行查询以检查有效的商品编号。通过重新安排事务逻辑,所有事务类都变成两阶段的。确实,所有事务类别都是强两阶段的。这是因为 Item 表永远不会更新,因此new_order发送到所有副本的所有事务始终会做出是否中止的相同决定。

All transaction classes except new_order are already two-phase since they never need to abort. New_order may need to abort, since it is possible that its input contains invalid item numbers. However, it is permissible in the TPC-C specification to run a query for each item number at the beginning of the transaction to check for valid item numbers. By rearranging the transaction logic, all transaction classes become two-phase. It is also true that all transaction classes are strongly two-phase. This is because the Item table is never updated, and therefore all new_order transactions sent to all replicas always reach the same decision of whether to abort or not.

当考虑基本分区和复制策略时,所有 5 个事务类别似乎都是无用的。对此,我们提出三点看法。

All 5 transaction classes appear to be sterile when considered with the basic partitioning and replication strategy. We make three observations in this regard.

首先,new_order事务在 Orders 表和 New_Orders 表中插入一个元组,并在 Line_order 表中插入行项目。在每个站点,这些操作将成为单个子计划的一部分,并且不会有交错的操作。这将确保order_status交易不会看到部分完成的新订单。其次,由于new_orderTPC-C 中的支付交易明显是两阶段的,因此,如果这些交易之一更新了相对于进行订单或付款的客户的“远程”仓库,则站点之间不需要进行额外的协调。

First, the new_order transaction inserts a tuple in both the Orders table and New_Orders table as well as line items in the Line_order table. At each site, these operations will be part of a single sub-plan, and there will be no interleaved operations. This will ensure that the order_status transaction does not see partially completed new orders. Second, because new_order and payment transactions in TPC-C are strongly two-phase, no additional coordination is needed between sites in the event that one of these transactions updates a “remote” warehouse relative to the customer making the order or payment.

第三,stock_level允许事务作为多个事务运行,这些事务可以在不同时间点查看不同商品的库存水平,只要库存水平来自已提交的事务即可。因为new_orders如果有必要,在执行任何更新之前会中止,因此读取的任何股票信息都来自已提交的事务(或即将提交的事务)。

Third, the stock_level transaction is allowed to run as multiple transactions which can see stock levels for different items at different points in time, as long as the stock level results from committed transactions. Because new_orders are aborted, if necessary, before they perform any updates, any stock information read comes from committed transactions (or transactions that will be committed soon).

因此,所有事务类都可以变得无菌且强两阶段。因此,它们在没有并发控制的情况下实现了 TPC-C 的有效执行。尽管我们可以测试此配置,但我们决定对工作负载进行额外的操作,以使所有事务类一次性完成,从而提高性能。

Hence, all transaction classes can be made sterile and strongly two-phase. As such, they achieve a valid execution of TPC-C with no concurrency control. Although we could have tested this configuration, we decided to employ additional manipulation of the workload to also make all transaction classes one-shot, doing so improves performance.

new_order使用基本策略,除了和 之外的所有事务类别payment都是单点的,因此是一次性的。Payment已经是一次性的了,因为更新远程仓库时不需要交换数据。New_order但是,需要在订单行中插入有关可能驻留在远程站点的库存条目的区域的信息。由于该字段永远不会更新,并且 Stock 表中没有删除/插入操作,因此我们可以垂直分区 Stock 并在所有站点上复制其中的只读部分。将这种复制技巧添加到基本策略中后,new_order就变成了一次成功。

With the basic strategy, all transaction classes, except new_order and payment are single-sited, and therefore one-shot. Payment is already one shot, since there is no need to exchange data when updating a remote warehouse. New_order, however, needs to insert in Order-line information about the district of a stock entry which may reside in a remote site. Since that field is never updated, and there are no deletes/inserts into the Stock table, we can vertically partition Stock and replicate the read-only parts of it across all sites. With this replication trick added to the basic strategy, new_order becomes one shot.

因此,通过用上述技巧增强基本策略,所有交易类别都成为一次性的和强两阶段的。只要我们像 4.4 节中提到的那样添加一个短延迟,就可以实现 ACID 属性没有任何并发​​控制开销。这是第 5.3 节中报告的基准测试结果的配置

As a result, with the basic strategy augmented with the tricks described above, all transaction classes become one-shot and strongly two-phase. As long as we add a short delay as mentioned in Section 4.4, ACID properties are achieved with no concurrency control overhead whatsoever. This is the configuration on which benchmark results are reported in Section 5.3

很难想象自动程序能够弄清楚使 TPC-C 一次性或无菌所需的条件。因此,知识渊博的人必须仔细编码交易类。然而,大多数交易类别可能更容易分析。因此,自动事务类分析的成功程度仍是一个悬而未决的问题。

It is difficult to imagine that an automatic program could figure out what is required to make TPC-C either one-shot or sterile. Hence, a knowledgeable human would have to carefully code the transactions classes. It is likely, however, that most transaction classes will be simpler to analyze. As such, it is an open question how successful automatic transaction class analysis will be.

5.2  实施

5.2  Implementation

我们在 H-Store 和非常流行的商业 RDBMS 上实现了 TPC-C 的变体。两个系统使用相同的驱动程序,并以最大速率生成事务,无需建模思考时间。这些事务使用 TCP/IP 传送到两个系统。所有事务类都作为存储过程实现。在 H-Store 中,事务逻辑是用 C++ 编码的,通过本地过程调用来执行 H-Store 查询。相比之下,商业系统的事务逻辑是使用其专有的存储过程语言编写的。这两个系统都不包括高可用性和与用户终端的通信。

We implemented a variant of TPC-C on H-Store and on a very popular commercial RDBMS. The same driver was used for both systems and generated transactions at the maximum rate without modeling think time. These transactions were delivered to both systems using TCP/IP. All transaction classes were implemented as stored procedures. In H-Store the transaction logic was coded in C++, with local procedure calls to H-Store query execution. In contrast, the transaction logic for the commercial system was written using their proprietary stored procedure language. High availability and communication with user terminals was not included for either system.

两个 DBMS 均在双核 2.8GHz CPU 计算机系统上运行,具有 4 GB 主内存和四个 250 GB SATA 磁盘驱动器。两种 DBMS 都利用了水平分区的优势。

Both DBMSs were run on a dual-core 2.8GHz CPU computer system, with 4 Gbytes of main memory and four 250 GB SATA disk drives. Both DBMSs used horizontal partitioning to advantage.

5.3  结果

5.3  Results

在此配置上,H-Store 每秒运行 70,416 个 TPC-C 事务。相比之下,尽管专门研究该供应商产品的专业 DBA 进行了几天的调优,但我们在商业系统中每秒只能处理 850 个事务。因此,H-Store 的运行速度提高了 82 倍(几乎两个数量级)。

On this configuration, H-Store ran 70,416 TPC-C transactions per second. In contrast, we could only coax 850 transactions per second from the commercial system, in spite of several days of tuning by a professional DBA, who specializes in this vendor’s product. Hence, H-Store ran a factor of 82 faster (almost two orders of magnitude).

根据我们之前的讨论,商业系统的瓶颈是日志记录开销。该系统大约有 2/3 的总运行时间花在日志系统上。我们中的一个人花了很多时间尝试调整日志系统(记录到专用磁盘,更改组提交的大小;所有这些都无济于事)。如果完全关闭日志记录,并且假设没有其他瓶颈出现,那么吞吐量将增加到每秒约 2,500 个事务。

Per our earlier discussion, the bottleneck for the commercial system was logging overhead. That system spent about 2/3 of its total elapsed time inside the logging system. One of us spent many hours trying to tune the logging system (log to a dedicated disk, change the size of the group commit; all to no avail). If logging was turned off completely, and assuming no other bottleneck creeps up, then throughput would increase to about 2,500 transactions per second.

下一个瓶颈似乎是并发控制系统。在未来的实验中,我们计划梳理以下原因造成的开销贡献:

The next bottleneck appears to be the concurrency control system. In future experiments, we plan to tease apart the overhead contributions which result from:

• 重做日志记录

•  Redo logging

• 撤消日志记录

•  Undo logging

• 锁存

• Latching

• 锁定

•  Locking

最后,虽然我们没有实现所有的 TPC-C 规范(例如,我们没有对等待时间进行建模),但将我们的部分 TPC-C 实现与 TPC 网站上的 TPC-C 性能记录进行比较也是有启发性的。2性能最高的 TPC-C 实施每分钟执行约 400 万个新订单事务,或每秒总共约 133,000 个总事务。这是在 128 核共享内存机器上,因此此实现每个核大约有 1000 个事务。将此与我们在(相当狭小的)台式机上的商业系统的基准测试中每个核心 400 个交易或 H-Store 中每个核心 35,000 个交易进行对比!另请注意,在成本约为 1000.00 美元的机器上,H-Store 与最佳 TPC-C 结果的两倍之内

Finally, though we did not implement all of the TPC-C specification (we did not, for example, model wait times), it is also instructive to compare our partial TPC-C implementation with TPC-C performance records on the TPC website.2 The highest performing TPC-C implementation executes about 4 million new-order transactions per minute, or a total of about 133,000 total transactions per second. This is on a 128 core shared memory machine, so this implementation is getting about 1000 transactions per core. Contrast this with 400 transactions per core in our benchmark on a commercial system on a (rather pokey) desktop machine, or 35,000 transactions per core in H-Store! Also, note that H-Store is within a factor of two of the best TPC-C results on a machine costing around $1000.00

综上所述,得出的结论是,按照 H-Store 设计的系统可以获得近两个数量级的性能提升。

In summary, the conclusion to be reached is that nearly two orders of magnitude in performance improvement are available to a system designed along the lines of H-Store.

6  关于“一刀切”世界的一些评论

6  Some Comments about a “One Size Does Not Fit All” World

如果本文的结果可信,那么我们将走向一个拥有至少 5 个(甚至可能更多)专用引擎的世界,并且“一刀切”的遗留系统将消亡。本节考虑这种架构转变的一些后果。

If the results of this paper are to be believed, then we are heading toward a world with at least 5 (and probably more) specialized engines and the death of the “one size fits all” legacy systems. This section considers some of the consequences of such an architectural shift.

6.1  关系模型不一定是答案

6.1  The Relational Model Is not Necessarily the Answer

在经历了 1974 年的大辩论 [ Rus74 ] 以及 Codasyl 和关系模型的倡导者之间的争论之后,我们不愿意提出这个特殊的“圣牛”。然而,考虑我们构建系统所围绕的数据模型(或多个数据模型)似乎是合适的。在 1970 年代,DBMS 世界仅包含业务数据处理应用程序,Ted Codd 将数据规范化为平面表的想法在随后的 30 年中为我们的社区提供了良好的服务。然而,现在还有其他市场,必须考虑其需求。其中包括数据仓库、面向网络的搜索、实时分析和半结构化数据市场。

Having survived the great debate of 1974 [Rus74] and the surrounding arguments between the advocates of the Codasyl and relational models, we are reluctant to bring up this particular “sacred cow”. However, it seems appropriate to consider the data model (or data models) that we build systems around. In the 1970’s the DBMS world contained only business data processing applications, and Ted Codd’s idea of normalizing data into flat tables has served our community well over the subsequent 30 years. However, there are now other markets, whose needs must be considered. These include data warehouses, web-oriented search, real-time analytics, and semi-structured data markets.

我们提出以下意见。

We offer the following observations.

1. 在数据仓库市场中,几乎 100% 的模式都是星型或雪花型,包含一个中心事实表,与周围的维度表进行 1-n 连接,这些事实表又可能进一步参与到二级维度表的 1-n 连接,等等。尽管使用关系模式很容易对星星和雪花进行建模,但事实上,在这种环境中实体关系模型会更简单、更自然。此外,仓库查询在 ER 模型中会更简单。最后,对于关系实现来说非常昂贵的仓库操作(例如更改维度表中行的键)可以通过某种 ER 实现来加快。

1. In the data warehouse market, nearly 100% of all schemas are stars or snowflakes, containing a central fact table with 1-n joins to surrounding dimension tables, which may in turn participate in further 1-n joins to second level dimension tables, and so forth. Although stars and snowflakes are easily modeled using relational schemas, in fact, an entity-relationship model would be simpler in this environment and more natural. Moreover, warehouse queries would be simpler in an E-R model. Lastly, warehouse operations that are incredibly expensive with a relational implementation, for example changing the key of a row in a dimension table, might be made faster with some sort of E-R implementation.

2、在流处理市场,需要:

2.  In the stream processing market, there is a need to:

(a) 高速处理消息流

(a)  Process streams of messages at high speed

(b) 将这些流与存储的数据关联起来

(b)  Correlate such streams with stored data

为了完成这两项任务,人们普遍热衷于 Stream-SQL,这是一种 SQL 的泛化,允许程序员在 SQL 语句的 FROM 子句中混合存储的表和流。这项工作是从斯坦福 Stream 小组 [ ABW06 ] 的开创性工作发展而来,并且正在积极讨论标准化。当然,StreamSQL 支持表和流的关系模式。

To accomplish both tasks, there is widespread enthusiasm for Stream-SQL, a generalization of SQL that allows a programmer to mix stored tables and streams in the FROM clause of a SQL statement. This work has evolved from the pioneering work of the Stanford Stream group [ABW06] and is being actively discussed for standardization. Of course, StreamSQL supports relational schemas for both tables and streams.

然而,商业源,例如路透社、Infodyne 等,都选择了一些数据模型来让他们的消息遵循。有些是扁平的并且非常适合关系模式。然而,有一些是分层的,例如外汇的FX feed。流处理系统,例如 StreamBase 和 Coral8,目前仅支持平面(关系)消息。在此类系统中,前端适配器必须将分层对象标准化为几种平面消息类型以进行处理。不幸的是,当需要对层次结构的多个部分进行处理时,将源消息的组成部分重新连接在一起是相当痛苦的。

However, commercial feeds, such as Reuters, Infodyne, etc., have all chosen some data model for their messages to obey. Some are flat and fit nicely into a relational schema. However, several are hierarchical, such as the FX feed for foreign exchange. Stream processing systems, such as StreamBase and Coral8, currently support only flat (relational) messages. In such systems, a front-end adaptor must normalize hierarchical objects into several flat message types for processing. Unfortunately, it is rather painful to join the constituent pieces of a source message back together when processing on multiple parts of a hierarchy is necessary.

为了解决这个问题,我们期望流处理供应商积极转向分层数据模型。因此,他们肯定会偏离特德·科德的原则。

To solve this problem, we expect the stream processing vendors to move aggressively to hierarchical data models. Hence, they will assuredly deviate from Ted Codd’s principles.

3. 文本处理显然从未使用过关系模型。

3.  Text processing obviously has never used a relational model.

4. 任何面向科学的 DBMS,例如 ASAP [ SBC+07 ],可能会实现数组,而不是表作为其基本数据类型。

4.  Any scientific-oriented DBMS, such as ASAP [SBC+07], will probably implement arrays, not tables as their basic data type.

5. 最近,关于半结构化数据的良好数据模型存在相当多的争论。对于过度复杂性肯定存在激烈的争论XMLSchema [ SC05 ]。有些人喜欢将 RDF 用于此类数据 [ MM04 ],也有些人认为 RDF 可以通过关系列存储有效地实现 [ AMM+07 ]。可以说,关于这个领域的发展方向有很多想法。

5.  There has recently been considerable debate over good data models for semi-structured data. There is certainly fierce debate over the excessive complexity of XMLSchema [SC05]. There are fans of using RDF for such data [MM04], and some who argue that RDF can be efficiently implemented by a relational column store [AMM+07]. Suffice it to say that there are many ideas on which way to go in this area.

总之,关系模型是为“一刀切”的世界而开发的。我们设想的各种专用系统都可以重新考虑哪种数据模型最适合其特定需求。

In summary, the relational model was developed for a “one size fits all” world. The various specialized systems which we envision can each rethink what data model would work best for their particular needs.

6.2   SQL 不是答案

6.2  SQL is Not the Answer

SQL 是一种“一刀切”的语言。在 OLTP 世界中,人们永远不会要求员工的收入高于其经理。事实上,如前所述,不存在临时查询。因此,可以实现一种比 SQL 更小的语言。出于性能原因,存储过程无处不在。在数据仓库世界中,人们需要不同的 SQL 子集,因为存在复杂的即席查询,但没有存储过程。因此,各种存储引擎可以实现垂直市场特定的语言,这将比 SQL 的令人畏惧的复杂性更简单。

SQL is a “one size fits all” language. In an OLTP world one never asks for the employees who earn more than their managers. In fact, there are no ad-hoc queries, as noted earlier. Hence, one can implement a smaller language than SQL. For performance reasons, stored procedures are omni-present. In a data warehouse world, one needs a different subset of SQL, since there are complex ad-hoc queries, but no stored procedures. Hence, the various storage engines can implement vertical-market specific languages, which will be simpler than the daunting complexity of SQL.

重新考虑应该存在多少种查询语言及其复杂性将带来巨大的附带好处。此时,SQL 是一种遗留语言,具有许多已知的严重缺陷,正如 Chris Date 二十年前指出的那样 [ Dat84 ]。下一次,我们可以做得更好。

Rethinking how many query languages should exist as well as their complexity will have a huge side benefit. At this point SQL is a legacy language with many known serious flaws, as noted by Chris Date two decades ago [Dat84]. Next time around, we can do a better job.

当重新思考数据访问语言时,我们会想起 20 世纪 70 年代的激烈讨论。一方面,有人提倡数据子语言,它可以与任何编程语言交互。这导致了高开销的接口,例如 JDBC 和 ODBC。此外,这些接口很难通过传统的编程语言来使用。

When rethinking data access languages, we are reminded of a raging discussion from the 1970’s. On the one-hand, there were advocates of a data sublanguage, which could be interfaced to any programming language. This has led to high overhead interfaces, such as JDBC and ODBC. In addition, these interfaces are very difficult to use from a conventional programming language.

相比之下,DBMS 社区的一些成员提出了在编程语言中更好地嵌入数据库功能的建议,以 20 世纪 70 年代的 Pascal R [ Sch80 ] 和 Rigel [ RS79 ] 为代表。两者都与编程语言设施(例如控制流、局部变量等)进行了清晰的集成。Chris Date 还出于相同目的提出了对 PL/1 的扩展 [Dat76 ]

In contrast, some members of the DBMS community proposed much nicer embedding of database capabilities in programming languages, typified in the 1970s by Pascal R [Sch80] and Rigel [RS79]. Both had clean integration with programming language facilities, such as control flow, local variables, etc. Chris Date also proposed an extension to PL/1 with the same purpose [Dat76].

显然,这些语言都没有流行起来,数据子语言阵营占了上风。我们社区设计的编程语言和数据子语言之间的耦合丑陋得令人难以置信,并且是来自不同时代的低生产力系统。因此,我们主张完全废除子语言,转而采用更清晰的语言嵌入。

Obviously none of these languages ever caught on, and the data sublanguage camp prevailed. The couplings between a programming language and a data sublanguage that our community has designed are ugly beyond belief and are low productivity systems that date from a different era. Hence, we advocate scrapping sublanguages completely, in favor of much cleaner language embeddings.

在编程语言社区中,Python、Perl、Ruby 和 PHP 等“小语言”呈爆炸式增长。这个想法是,人们应该针对手头的任何特定任务使用最好的语言。小语言也很有吸引力,因为它们比通用语言更容易学习。从远处看,这种现象似乎是编程语言世界中“一刀切”的消亡。

In the programming language community, there has been an explosion of “little languages” such as Python, Perl, Ruby and PHP. The idea is that one should use the best language available for any particular task at hand. Also little languages are attractive because they are easier to learn than general purpose languages. From afar, this phenomenon appears to be the death of “one size fits all” in the programming language world.

小语言有两个非常理想的特性。首先,它们大多是开源的,并且可以由社区进行更改。其次,与当前的通用语言相比,它们的修改难度较小。因此,我们主张修改小语言以包含 DBMS 访问的干净嵌入。

Little languages have two very desirable properties. First, they are mostly open source, and can be altered by the community. Second they are less daunting to modify than the current general purpose languages. As such, we are advocates of modifying little languages to include clean embeddings of DBMS access.

我们目前最喜欢的这种方法的例子是 Ruby-on-Rails。3该系统是小型语言 Ruby,通过“模型-视图-控制器”编程模式扩展了对数据库访问和操作的集成支持。Ruby-on-Rails 编译为标准 JDBC,但隐藏了该接口的所有复杂性。

Our current favorite example of this approach is Ruby-on-Rails.3 This system is the little language, Ruby, extended with integrated support for database access and manipulation through the “model-view-controller” programming pattern. Ruby-on-Rails compiles into standard JDBC, but hides all the complexity of that interface.

因此,H-Store 计划从 C++ 转向 Ruby-on-Rails 作为我们的存储过程语言。当然,语言运行时必须链接到 DBMS 地址空间,并且必须进行更改以使用高性能本地过程调用(而不是 JDBC)来调用 DBMS 服务。

Hence, H-Store plans to move from C++ to Ruby-on-Rails as our stored procedure language. Of course, the language run-time must be linked into the DBMS address space, and must be altered to make calls to DBMS services using high performance local procedure calls, not JDBC.

7  总结和未来工作

7  Summary and Future Work

在过去的四分之一个世纪里,发生了巨大的转变:

In the last quarter of a century, there has been a dramatic shift in:

1. DBMS市场:从业务数据处理到市场集合,具有不同的要求

1.  DBMS markets: from business data processing to a collection of markets, with varying requirements

2. 必要的功能:新的要求包括无共享支持和高可用性

2.  Necessary features: new requirements include shared nothing support and high availability

3. 技术:大容量主存储器、热备用的可能性以及网络改变了一切

3.  Technology: large main memories, the possibility of hot standbys, and the web change most everything

结果是:

The result is:

1.“一刀切”的消亡预测

1.  The predicted demise of “one size fits all”

2. 当前的关系实现对于市场的任何部分都是不合适的

2.  The inappropriateness of current relational implementations for any segment of the market

3. 重新思考专用引擎的数据模型和查询语言的必要性,我们预计这些引擎将在各个垂直市场中占据主导地位

3.  The necessity of rethinking both data models and query languages for the specialized engines, which we expect to be dominant in the various vertical markets

我们的 H-Store 原型展示了当这种传统思维受到质疑时可以获得的性能提升。当然,除了这些令人鼓舞的初步绩效结果之外,未来还需要在许多领域开展工作。尤其:

Our H-Store prototype demonstrates the performance gains that can be had when this conventional thinking is questioned. Of course, beyond these encouraging initial performance results, there are a number of areas where future work is needed. In particular:

• 需要做更多的工作来确定何时可以自动识别单站点、两阶段和一次性应用程序。可以建议导致这些属性的分区的“自动一切”工具也很重要。

•  More work is needed to identify when it is possible to automatically identify single-sited, two-phase, and one-shot applications. “Auto-everything” tools that can suggest partitions that lead to these properties are also essential.

• 多核机器的兴起表明,可能存在与物理上位于同一机器上的逻辑站点之间的工作共享相关的有趣的优化。

•  The rise of multi-core machines suggests that there may be interesting optimizations related to sharing of work between logical sites physically co-located on the same machine.

• 需要仔细研究第3 节中概述的各种事务管理策略的性能。

•  A careful study of the performance of the various transaction management strategies outlined in Section 3 is needed.

• 研究 OLTP 系统各个组件的开销(日志记录、事务处理和两阶段提交、锁定、JDBC/ODBC 等)将有助于确定传统 DBMS 设计的哪些方面对我们观察到的开销贡献最大。

•  A study of the overheads of the various components of a OLTP system—logging, transaction processing and two-phase commit, locking, JDBC/ODBC, etc—would help identify which aspects of traditional DBMS design contribute most to the overheads we have observed.

• 去除所有这些开销后,我们的 H-Store 实现现在受到内存数据结构性能的限制,这表明优化这些结构非常重要。例如,我们发现将只读表表示为数组的简单优化可以显着提高 H-Store 实现中的事务吞吐量。

•  After stripping out all of these overheads, our H-Store implementation is now limited by the performance of in-memory data structures, suggesting that optimizing these structures will be important. For example, we found that the simple optimization of representing read-only tables as arrays offered significant gains in transaction throughput in our H-Store implementation.

• 如果类似H-Store 的系统要与数据仓库无缝共存,那么与数据仓库工具的集成(例如,通过使用无覆盖存储以及偶尔将记录转储到仓库中)将至关重要。

•  Integration with data warehousing tools—for example, by using no-overwrite storage and occasionally dumping records into a warehouse—will be essential if H-Store-like systems are to seamlessly co-exist with data warehouses.

简而言之,DBMS社区的现状让我们想起了1970-1985年间,当时人们在“集体摸索”构建DBMS引擎的最佳方式,商业产品和DBMS供应商随之发生了巨大的变化。1970年至1985年是一个争论激烈、思想纷繁、剧变的时期。

In short, the current situation in the DBMS community reminds us of the period 1970-1985 where there was a “group grope” for the best way to build DBMS engines and dramatic changes in commercial products and DBMS vendors ensued. The 1970-1985 period was a time of intense debate, a myriad of ideas, and considerable upheaval.

我们预测未来十五年也会有同样的感觉。

We predict the next fifteen years will have the same feel.

参考

References

[ ABW06 ] A. Arasu、S. Babu 和 J. Widom。“CQL 连续查询语言:语义基础和查询执行。” VLDB 期刊,15(2),2006 年 6 月。

[ABW06] A. Arasu, S. Babu, and J. Widom. “The CQL Continuous Query Language: Semantic Foundations and Query Execution.” The VLDB Journal, 15(2), June 2006.

[ ACL87 ] R. Agrawal、MJ Carey 和 M. Livny。“并发控制性能建模:替代方案和影响。” ACM 翻译。数据库系统。12(4),1987 年 11 月。

[ACL87] R. Agrawal, M.J. Carey, and M. Livny. “Concurrency control performance modeling: alternatives and implications.” ACM Trans. Database Syst. 12(4), Nov. 1987.

[ AMM+07 ]D.阿巴迪、A.马库斯、S.马登和K.霍伦巴赫。“使用垂直分区的可扩展语义 Web 数据管理。” 在过程中。VLDB,2007。

[AMM+07] D. Abadi, A. Marcus, S. Madden, and K. Hollenbach. “Scalable Semantic Web Data Management Using Vertical Partitioning.” In Proc. VLDB, 2007.

[ Ants07 ] ANTs 软件。ANT 数据服务器技术白皮书,http://www.ants.com,2007年。

[Ants07] ANTs Software. ANTs Data Server-Technical White Paper, http://www.ants.com, 2007.

[ BSR80 ] PA Bernstein、D. Shipman 和 JB Rothnie。“分布式数据库系统中的并发控制 (SDD-1)。” ACM 翻译。数据库系统。5(1),1980 年 3 月。

[BSR80] P. A. Bernstein, D. Shipman, and J. B. Rothnie. “Concurrency Control in a System for Distributed Databases (SDD-1).” ACM Trans. Database Syst. 5(1), March 1980.

[ Bon02 ] PA Boncz。“Monet:用于查询密集型应用程序的下一代 DBMS 内核。” 博士 论文,阿姆斯特丹大学,荷兰阿姆斯特丹,2002 年 5 月。

[Bon02] P. A. Boncz. “Monet: A Next-Generation DBMS Kernel For Query-Intensive Applications.” Ph.D. Thesis, Universiteit van Amsterdam, Amsterdam, The Netherlands, May 2002.

[ Dat76 ] CJ 日期。“高级语言数据库扩展的架构。” 在过程中。西格莫德,1976。

[Dat76] C. J. Date. “An Architecture for High-Level Language Database Extensions.” In Proc. SIGMOD, 1976.

[ Dat84 ] CJ 日期。“对 SQL 数据库语言的批评。” 见SIGMOD Record 14(3):8-54,1984 年 11 月。

[Dat84] C. J. Date. “A critique of the SQL database language.” In SIGMOD Record 14(3):8-54, Nov. 1984.

[ DGS+90 ] DJ Dewitt、S. Ghandeharizadeh、DA Schneider、A. Bricker、H. Hsiao 和 R. Rasmussen。“伽玛数据库机项目。” IEEE 知识与数据工程汇刊 2(1):44-62,1990年 3 月。

[DGS+90] D. J. Dewitt, S. Ghandeharizadeh, D. A. Schneider, A. Bricker, H. Hsiao, and R. Rasmussen. “The Gamma Database Machine Project.” IEEE Transactions on Knowledge and Data Engineering 2(1):44-62, March 1990.

[ Hel07 ] P. Helland。“分布式交易之外的生活:叛教者的观点。” 在过程中。CIDR,2007。

[Hel07] P. Helland. “Life beyond Distributed Transactions: an Apostate’s Opinion.” In Proc. CIDR, 2007.

[ HM93 ] M. Herlihy 和 JE Moss。“事务内存:无锁数据结构的架构支持。” 在过程中。伊斯卡,1993。

[HM93] M. Herlihy and J. E. Moss. “Transactional memory: architectural support for lock-free data structures.” In Proc. ISCA, 1993.

[KL81] HT Kung 和 JT Robinson。“关于并发控制的乐观方法。” ACM 翻译。数据库系统。6(2):213–226,1981年 6 月。

[KL81] H. T. Kung and J. T. Robinson. “On optimistic methods for concurrency control.” ACM Trans. Database Syst. 6(2):213–226, June 1981.

[ LM06 ] E. Lau 和 S. Madden。“在可更新的分布式数据仓库中实现恢复和高可用性的集成方法。” 在过程中。VLDB,2006。

[LM06] E. Lau and S. Madden. “An Integrated Approach to Recovery and High Availability in an Updatable, Distributed Data Warehouse.” In Proc. VLDB, 2006.

[ MHL+92 ] C. Mohan、D. Haderle、B. Lindsay、H. Pirahesh 和 P. Schwarz。“ARIES:一种事务恢复方法,支持使用预写日志记录的细粒度锁定和部分回滚。” ACM 翻译。数据库系统。17(1):94-162,1992年 3 月。

[MHL+92] C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and P. Schwarz. “ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging.” ACM Trans. Database Syst. 17(1):94-162, March 1992.

[ Mil89 ] DL 米尔斯。“论互联网系统中网络时间协议同步时钟的准确性和稳定性”。SIGCOMM 计算。交流。修订版 20(1):65-75,1989年 12 月。

[Mil89] D. L. Mills. “On the Accuracy and Stability of Clocks Synchronized by the Network Time Protocol in the Internet System.” SIGCOMM Comput. Commun. Rev. 20(1):65-75, Dec. 1989.

[ MM04 ] F. Manola 和 E. Miller,(编辑)。RDF 底漆。W3C 规范,2004 年 2 月 10 日。http://www.w3.org/TR/REC-rdf-primer-20040210/

[MM04] F. Manola and E. Miller, (eds). RDF Primer. W3C Specification, February 10, 2004. http://www.w3.org/TR/REC-rdf-primer-20040210/

[ RR99 ] J. Rao 和 KA Ross。“用于主内存中决策支持的缓存意识索引。” 在过程中。VLDB,1999。

[RR99] J. Rao and K. A. Ross. “Cache Conscious Indexing for Decision-Support in Main Memory.” In Proc. VLDB, 1999.

[ RR00 ] J. Rao 和 KA Ross。“让 B+ 树在主内存中具有缓存意识。” 见SIGMOD Record,29(2):475-486,2000年 6 月。

[RR00] J. Rao and K. A. Ross. “Making B+-trees cache conscious in main memory.” In SIGMOD Record, 29(2):475-486, June 2000.

[ RS79 ] LA Rowe 和 KA Shoens。“RIGEL 中的数据抽象、视图和更新。” 在过程中。西格莫德,1979。

[RS79] L. A. Rowe and K. A. Shoens. “Data Abstractions, Views and Updates in RIGEL.” In Proc. SIGMOD, 1979.

[ Rus74 ] Randall Rustin(编辑):1974 年 ACM-SIGMOD 数据描述、访问和控制研讨会论文集,密歇根州安娜堡,1974 年 5 月 1-3 日,2 卷。

[Rus74] Randall Rustin (Ed.): Proceedings of 1974 ACM-SIGMOD Workshop on Data Description, Access and Control, Ann Arbor, Michigan, May 1-3,1974, 2 Volumes.

[ SAB+05 ] M. Stonebraker、D. Abadi、A. Batkin、X. Chen、M. Cherniack、M. Ferreira、E. Lau、A. Lin、S. Madden、E. O'Neil、P. O '尼尔、A. Rasin、N. Tran 和 S. Zdonik。“C-Store:面向列的 DBMS。” 在过程中。VLDB,2005。

[SAB+05] M. Stonebraker, D. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O’Neil, P. O’Neil, A. Rasin, N. Tran, and S. Zdonik. “C-Store: A Column-oriented DBMS.” In Proc. VLDB, 2005.

[ SBC+07 ] M. Stonebraker、C. Bear、U. Cetintemel、M. Cherniack、T. Ge、N. Hachem、S. Harizopoulos、J. Lifter、J. Rogers 和 S. Zdonik。“一刀切?-第 2 部分:基准测试结果。” 在过程中。CIDR,2007。

[SBC+07] M. Stonebraker, C. Bear, U. Cetintemel, M. Cherniack, T. Ge, N. Hachem, S. Harizopoulos, J. Lifter, J. Rogers, and S. Zdonik. “One Size Fits All?-Part 2: Benchmarking Results.” In Proc. CIDR, 2007.

[ SC05 ] M. Stonebraker 和 U. Cetintemel。“一刀切:一个想法的时代已经过去了。” 在过程中。ICDE,2005。

[SC05] M. Stonebraker and U. Cetintemel. “One Size Fits All: An Idea whose Time has Come and Gone.” In Proc. ICDE, 2005.

[ Sch80 ] JW 施密特等人。“帕斯卡/R 报告。” 汉堡大学,Fachbereich Informatik,报告 66,1980 年 1 月。

[Sch80] J. W. Schmidt, et al. “Pascal/R Report.” U Hamburg, Fachbereich Informatik, Report 66, Jan 1980.

[ TPCC ] 交易处理委员会。TPC-C 基准(修订版 5.8.0),2006 年。http://www.tpc.org/tpcc/spec/tpcc_current.pdf

[TPCC] The Transaction Processing Council. TPC-C Benchmark (Revision 5.8.0), 2006. http://www.tpc.org/tpcc/spec/tpcc_current.pdf

允许免费复制本材料的全部或部分内容,前提是这些副本不是为了直接商业利益而制作或分发的,VLDB 版权声明以及出版物的标题和日期出现,并且通知复制是由超大型数据库捐赠基金的许可。要进行其他复制、重新发布、在服务器上发布或重新分发到列表,需要付费和/或获得发布者 ACM 的特殊许可。

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Database Endowment. To copy otherwise, or to republish, to post on servers or to redistribute to lists, requires a fee and/or special permissions from the publisher, ACM.

VLDB '07,2007 年 9 月 23-28 日,奥地利维也纳。

VLDB ’07, September 23–28, 2007, Vienna, Austria.

版权所有 2007 VLDB 捐赠基金,ACM 978-1-59593-649-3/07/09。

Copyright 2007 VLDB Endowment, ACM 978-1-59593-649-3/07/09.

最初发表于第 33 届超大型数据库国际会议论文集,第 1150–1160 页,2007 年。

Originally published in Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 1150–1160, 2007.

1 . http://www.vertica.com

1. http://www.vertica.com

2 . http://www.tcp.org/tpcc/results/tpcc_perf_results.asp

2. http://www.tcp.org/tpcc/results/tpcc_perf_results.asp

3 . http://www.rubyonrails.org

3. http://www.rubyonrails.org

C-Store:面向列的 DBMS

C-Store: A Column-Oriented DBMS

Mike Stonebraker (麻省理工学院 CSAIL ) 、Daniel J. Abadi (麻省理工学院 CSAIL ) 、Adam Batkin (布兰代斯大学) 、陈学东(麻省大学波士顿分校)、Mitch Cherniack (布兰代斯大学) Miguel Ferreira (麻省理工学院CSAIL )Edmond Lau (麻省理工学院 CSAIL ) , Amerson Lin (麻省理工学院 CSAIL ) , Sam Madden (麻省理工学院 CSAIL ) , Elizabeth O'Neil (麻省大学波士顿分校) , Pat O'Neil (麻省大学波士顿分校、Alex Rasin布朗大学、Nga Tran布兰代斯大学、Stan Zdonik布朗大学

Mike Stonebraker (MIT CSAIL), Daniel J. Abadi (MIT CSAIL), Adam Batkin (Brandeis University), Xuedong Chen (UMass Boston), Mitch Cherniack (Brandeis University), Miguel Ferreira (MIT CSAIL), Edmond Lau (MIT CSAIL), Amerson Lin (MIT CSAIL), Sam Madden (MIT CSAIL), Elizabeth O’Neil (UMass Boston), Pat O’Neil (UMass Boston), Alex Rasin (Brown University), Nga Tran (Brandeis University), Stan Zdonik (Brown University)

抽象的

Abstract

本文介绍了一种读取优化的关系型 DBMS 的设计,该设计与当前大多数写入优化的系统形成鲜明对比。其设计上的许多差异包括:按列而不是按行存储数据,在查询处理期间仔细编码和将对象打包到存储(包括主内存)中,存储面向列的投影的重叠集合,而不是当前的表和索引、事务的非传统实现(包括只读事务的高可用性和快照隔离)以及广泛使用位图索引来补充 B 树结构。

This paper presents the design of a read-optimized relational DBMS that contrasts sharply with most current systems, which are write-optimized. Among the many differences in its design are: storage of data by column rather than by row, careful coding and packing of objects into storage including main memory during query processing, storing an overlapping collection of column-oriented projections, rather than the current fare of tables and indexes, a non-traditional implementation of transactions which includes high availability and snapshot isolation for read-only transactions, and the extensive use of bitmap indexes to complement B-tree structures.

我们提供了 TPC-H 子集的初步性能数据,并表明我们正在构建的系统 C-Store 比流行的商业产品快得多。因此,该架构看起来非常令人鼓舞。

We present preliminary performance data on a subset of TPC-H and show that the system we are building, C-Store, is substantially faster than popular commercial products. Hence, the architecture looks very encouraging.

1  简介

1  Introduction

大多数主要的 DBMS 供应商都实现面向记录的存储系统,其中记录(或元组)的属性连续放置在存储中。通过这种行存储架构,单个磁盘写入足以将单个记录的所有字段推送到磁盘。因此,实现了高性能写入,我们将具有行存储架构的 DBMS 称为写入优化系统。这些对于 OLTP 类型的应用程序尤其有效。

Most major DBMS vendors implement record-oriented storage systems, where the attributes of a record (or tuple) are placed contiguously in storage. With this row store architecture, a single disk write suffices to push all of the fields of a single record out to disk. Hence, high performance writes are achieved, and we call a DBMS with a row store architecture a write-optimized system. These are especially effective on OLTP-style applications.

相反,面向大量数据即席查询的系统应该进行读取优化。数据仓库代表一类读取优化的系统,其中定期执行大量新数据加载,然后进行相对较长时间的即席查询。其他以阅读为主的应用程序包括客户关系管理 (CRM) 系统、电子图书卡目录和其他临时查询系统。在这种环境中,连续存储每个单列(或属性)的值的列存储架构应该更高效。Sybase IQ [ FREN95SYBA04 ]、Addamark [ ADDA04 ]等产品在仓库市场中证明了这种效率] 和 KDB [ KDB04 ]。在本文中,我们讨论了称为 C-Store 的列存储的设计,它包含许多相对于现有系统的新颖功能。

In contrast, systems oriented toward ad-hoc querying of large amounts of data should be read-optimized. Data warehouses represent one class of read-optimized system, in which periodically a bulk load of new data is performed, followed by a relatively long period of ad-hoc queries. Other read-mostly applications include customer relationship management (CRM) systems, electronic library card catalogs, and other ad-hoc inquiry systems. In such environments, a column store architecture, in which the values for each single column (or attribute) are stored contiguously, should be more efficient. This efficiency has been demonstrated in the warehouse marketplace by products like Sybase IQ [FREN95, SYBA04], Addamark [ADDA04], and KDB [KDB04]. In this paper, we discuss the design of a column store called C-Store that includes a number of novel features relative to existing systems.

通过列存储架构,DBMS 只需读取处理给定查询所需的列的值,并且可以避免将不相关的属性带入内存。在典型查询涉及对大量数据项执行聚合的仓库环境中,列存储具有相当大的性能优势。然而,读取优化的架构和写入优化的架构之间还存在其他几个主要区别。

With a column store architecture, a DBMS need only read the values of columns required for processing a given query, and can avoid bringing into memory irrelevant attributes. In warehouse environments where typical queries involve aggregates performed over large numbers of data items, a column store has a sizeable performance advantage. However, there are several other major distinctions that can be drawn between an architecture that is read-optimized and one that is write-optimized.

当前的关系 DBMS 旨在将属性填充到字节或字边界,并以其本机数据格式存储值。人们认为将数据值转移到主存储器中的字节或字边界上进行处理的成本太高。然而,CPU 的速度增长速度远远快于磁盘带宽的增长速度。因此,用充足的 CPU 周期来换取并不充足的磁盘带宽是有意义的。在以读取为主的环境中,这种权衡似乎特别有利可图。

Current relational DBMSs were designed to pad attributes to byte or word boundaries and to store values in their native data format. It was thought that it was too expensive to shift data values onto byte or word boundaries in main memory for processing. However, CPUs are getting faster at a much greater rate than disk bandwidth is increasing. Hence, it makes sense to trade CPU cycles, which are abundant, for disk bandwidth, which is not. This tradeoff appears especially profitable in a read-mostly environment.

列存储可以通过两种方式使用 CPU 周期来节省磁盘带宽。首先,它可以将数据元素编码成更紧凑的形式。例如,如果存储的是客户居住州的属性,则美国各州可以编码为 6 位,而两字符缩写需要 16 位,而州名的可变长度字符串则需要许多位。更多的。二、要密实包装存储中的值。例如,在列存储中,可以直接将 N 个值(每个 K 位长)打包为 N*K 位。列存储相对于行存储的编码和可压缩性优势已在[FREN95]中指出。当然,还希望 DBMS 查询执行器尽可能对压缩表示进行操作,以避免解压缩的成本,至少在需要将值呈现给应用程序之前是这样。

There are two ways a column store can use CPU cycles to save disk bandwidth. First, it can code data elements into a more compact form. For example, if one is storing an attribute that is a customer’s state of residence, then US states can be coded into six bits, whereas the two-character abbreviation requires 16 bits and a variable length character string for the name of the state requires many more. Second, one should densepack values in storage. For example, in a column store it is straightforward to pack N values, each K bits long, into N*K bits. The coding and compressibility advantages of a column store over a row store have been previously pointed out in [FREN95]. Of course, it is also desirable to have the DBMS query executor operate on the compressed representation whenever possible to avoid the cost of decompression, at least until values need to be presented to an application.

商业关系 DBMS 存储表格数据的完整元组以及表中属性的辅助 B 树索引。此类索引可以是主索引,即表中的行以尽可能接近指定属性的排序顺序存储,也可以是辅助索引,在这种情况下,不会尝试在索引属性上保持基础记录的顺序。此类索引在 OLTP 写入优化环境中有效,但在读取优化环境中表现不佳。在后一种情况下,其他数据结构是有利的,包括位图索引[ ONEI97 ]、交叉表索引[ ORAC04 ]和物化视图[ CERI91 ]]。在读优化 DBMS 中,人们可以探索仅使用这些读优化结构来存储数据,而根本不支持写优化结构。

Commercial relational DBMSs store complete tuples of tabular data along with auxiliary B-tree indexes on attributes in the table. Such indexes can be primary, whereby the rows of the table are stored in as close to sorted order on the specified attribute as possible, or secondary, in which case no attempt is made to keep the underlying records in order on the indexed attribute. Such indexes are effective in an OLTP write-optimized environment but do not perform well in a read-optimized world. In the latter case, other data structures are advantageous, including bit map indexes [ONEI97], cross table indexes [ORAC04], and materialized views [CERI91]. In a read-optimized DBMS one can explore storing data using only these read-optimized structures, and not support write-optimized ones at all.

因此,C-Store 物理上存储列的集合,每个列都根据某些属性进行排序。按同一属性排序的列组称为“投影”;同一列可能存在于多个投影中,并且可能根据每个投影中的不同属性进行排序。我们期望我们积极的压缩技术将使我们能够支持许多列排序顺序,而不会导致空间爆炸。多种排序顺序的存在为优化提供了机会。

Hence, C-Store physically stores a collection of columns, each sorted on some attribute(s). Groups of columns sorted on the same attribute are referred to as “projections”; the same column may exist in multiple projections, possibly sorted on a different attribute in each. We expect that our aggressive compression techniques will allow us to support many column sort-orders without an explosion in space. The existence of multiple sort-orders opens opportunities for optimization.

显然,现成的“刀片”或“网格”计算机集合将成为计算和存储密集型应用程序(例如 DBMS [ DEWI92 ])最便宜的硬件架构。因此,任何新的 DBMS 架构都应该假设一个网格环境,其中有 G 个节点(计算机),每个节点都有专用磁盘和专用内存。我们建议在“无共享”架构中跨各个节点的磁盘水平分区数据[ STON86]。在不久的将来,网格计算机可能有数十到数百个节点,任何新系统都应该针对这种规模的网格进行架构设计。当然,网格计算机的节点可以在物理上位于同一位置或划分为位于同一位置的节点的集群。由于数据库管理员很难优化网格环境,因此必须自动将数据结构分配给网格节点。此外,存储数据结构的水平分区促进了查询内并行性,并且我们在实现此构造时遵循 Gamma [ DEWI90 ] 的领导。

Clearly, collections of off-the-shelf “blade” or “grid” computers will be the cheapest hardware architecture for computing and storage intensive applications such as DBMSs [DEWI92]. Hence, any new DBMS architecture should assume a grid environment in which there are G nodes (computers), each with private disk and private memory. We propose to horizontally partition data across the disks of the various nodes in a “shared nothing” architecture [STON86]. Grid computers in the near future may have tens to hundreds of nodes, and any new system should be architected for grids of this size. Of course, the nodes of a grid computer may be physically co-located or divided into clusters of co-located nodes. Since database administrators are hard pressed to optimize a grid environment, it is essential to allocate data structures to grid nodes automatically. In addition, intra-query parallelism is facilitated by horizontal partitioning of stored data structures, and we follow the lead of Gamma [DEWI90] in implementing this construct.

许多仓库系统(例如沃尔玛 [ WEST00 ])维护其数据的两个副本,因为通过 DBMS 日志处理对非常大(TB)的数据集进行恢复的成本过高。由于磁盘每字节成本的下降,该选项变得越来越有吸引力。网格环境允许将此类副本存储在不同的处理节点上,从而支持 Tandem 式的高可用系统 [ TAND89]。然而,不要求以完全相同的方式存储多个副本。C-Store 允许以不同的排序顺序存储冗余对象,除了高可用性之外还提供更高的检索性能。一般来说,只要设计冗余,即使 G 个站点之一发生故障,也可以访问所有数据,存储重叠投影可以进一步提高性能。我们将能够容忍 K 次故障的系统称为K-safe。C-Store 可配置为支持一系列 K 值。

Many warehouse systems (e.g. Walmart [WEST00]) maintain two copies of their data because the cost of recovery via DBMS log processing on a very large (terabyte) data set is prohibitive. This option is rendered increasingly attractive by the declining cost per byte of disks. A grid environment allows one to store such replicas on different processing nodes, thereby supporting a Tandem-style highly-available system [TAND89]. However, there is no requirement that one store multiple copies in the exact same way. C-Store allows redundant objects to be stored in different sort orders providing higher retrieval performance in addition to high availability. In general, storing overlapping projections further improves performance, as long as redundancy is crafted so that all data can be accessed even if one of the G sites fails. We call a system that tolerates K failures K-safe. C-Store will be configurable to support a range of values of K.

即使在以读取为主的环境中,执行事务性更新显然也很重要。仓库需要执行在线更新以纠正错误。此外,人们越来越多地推动实时仓库,将数据可见性的延迟缩小到零。最终的愿望是数据仓库的在线更新。显然,在像 CRM 这样以阅读为主的世界中,人们需要执行一般的在线更新。

It is clearly essential to perform transactional updates, even in a read-mostly environment. Warehouses have a need to perform on-line updates to correct errors. As well, there is an increasing push toward real-time warehouses, where the delay to data visibility shrinks toward zero. The ultimate desire is on-line update to data warehouses. Obviously, in read-mostly worlds like CRM, one needs to perform general on-line updates.

提供更新和优化读取数据结构之间存在紧张关系。例如,在 KDB 和 Addamark 中,数据列按条目序列顺序维护。这允许在列的末尾高效地插入新数据项,无论是批处理还是事务处理。然而,代价是检索结构不太理想,因为大多数查询工作负载在数据按其他顺序排列时运行得更快。然而,以非条目顺序存储列将使插入变得非常困难且昂贵。

There is a tension between providing updates and optimizing data structures for reading. For example, in KDB and Addamark, columns of data are maintained in entry sequence order. This allows efficient insertion of new data items, either in batch or transactionally, at the end of the column. However, the cost is a less-than-optimal retrieval structure, because most query workloads will run faster with the data in some other order. However, storing columns in non-entry sequence will make insertions very difficult and expensive.

便利店从一个全新的角度解决了这个困境。具体来说,我们将读取优化的列存储和面向更新/插入的可写存储组合在单个系统软件中,并通过元组移动器连接,如图1所示。在顶层,有一个小型可写存储(WS)组件,该组件的架构旨在支持高性能插入和更新。还有一个更大的组件,称为读取优化存储 (RS),它能够支持大量信息。顾名思义,RS 针对读取进行了优化,并且仅支持非常有限的插入形式,即记录从 WS 到 RS 的批量移动,这是由图 1 的元组移动器执行的任务

C-Store approaches this dilemma from a fresh perspective. Specifically, we combine in a single piece of system software, both a read-optimized column store and an update/insert-oriented writeable store, connected by a tuple mover, as noted in Figure 1. At the top level, there is a small Writeable Store (WS) component, which is architected to support high performance inserts and updates. There is also a much larger component called the Read-optimized Store (RS), which is capable of supporting very large amounts of information. RS, as the name implies, is optimized for read and supports only a very restricted form of insert, namely the batch movement of records from WS to RS, a task that is performed by the tuple mover of Figure 1.

图像

图1  便利店架构

Figure 1  Architecture of C-Store

当然,查询必须访问两个存储系统中的数据。插入被发送到 WS,而删除必须在 RS 中标记,以便稍后由元组移动器清除。更新作为插入和删除来实现。为了支持高速元组移动器,我们使用 LSM 树概念的变体 [ ONEI96 ],它支持合并输出过程,通过合并有序 WS 数据对象的有效方法将元组从 WS 批量移动到 RS具有较大的 RS 块,导致操作完成时安装一个新的 RS 副本。

Of course, queries must access data in both storage systems. Inserts are sent to WS, while deletes must be marked in RS for later purging by the tuple mover. Updates are implemented as an insert and a delete. In order to support a high-speed tuple mover, we use a variant of the LSM-tree concept [ONEI96], which supports a merge out process that moves tuples from WS to RS in bulk by an efficient method of merging ordered WS data objects with large RS blocks, resulting in a new copy of RS that is installed when the operation completes.

图 1的体系结构必须支持许多大型即席查询、较小的更新事务以及可能的连续插入的环境中的事务。显然,盲目支持动态锁会导致大量的读写冲突,并且由于阻塞和死锁而导致性能下降。

The architecture of Figure 1 must support transactions in an environment of many large ad-hoc queries, smaller update transactions, and perhaps continuous inserts. Obviously, blindly supporting dynamic locking will result in substantial read-write conflict and performance degradation due to blocking and deadlocks.

相反,我们希望只读查询在历史模式下运行。在此模式下,查询选择的时间戳 T 小于最近提交的事务之一,并且查询在语义上保证生成截至历史记录中该点的正确答案。提供这种快照隔离[ BERE95 ]需要C-Store在插入数据元素时对数据元素进行时间戳,并对运行时系统进行仔细编程以忽略时间戳晚于T的元素。

Instead, we expect read-only queries to be run in historical mode. In this mode, the query selects a timestamp, T, less than the one of the most recently committed transactions, and the query is semantically guaranteed to produce the correct answer as of that point in history. Providing such snapshot isolation [BERE95] requires C-Store to timestamp data elements as they are inserted and to have careful programming of the runtime system to ignore elements with timestamps later than T.

最后,大多数商业优化器和执行器都是面向行的,显然是为市场上流行的行存储而构建的。由于RS和WS都是面向列的,因此构建面向列的优化器和执行器是有意义的。正如我们将看到的,该软件看起来与当今流行的传统设计完全不同。

Lastly, most commercial optimizers and executors are row-oriented, obviously built for the prevalent row stores in the marketplace. Since both RS and WS are column-oriented, it makes sense to build a column-oriented optimizer and executor. As will be seen, this software looks nothing like the traditional designs prevalent today.

在本文中,我们概述了可更新列存储 C-Store 的设计,它可以同时在仓库式查询上实现非常高的性能,并在 OLTP 式事务上实现合理的速度。C-Store 是一种面向列的 DBMS,其架构旨在减少每个查询的磁盘访问次数。便利店的创新特点包括:

In this paper, we sketch the design of our updatable column store, C-Store, that can simultaneously achieve very high performance on warehouse-style queries and achieve reasonable speed on OLTP-style transactions. C-Store is a column-oriented DBMS that is architected to reduce the number of disk accesses per query. The innovative features of C-Store include:

1. 混合架构,其中 WS 组件针对频繁插入和更新进行了优化,RS 组件针对查询性能进行了优化。

1. A hybrid architecture with a WS component optimized for frequent insert and update and an RS component optimized for query performance.

2.将表的元素以不同顺序冗余存储在多个重叠投影中,以便可以使用最有利的投影来解决查询。

2.  Redundant storage of elements of a table in several overlapping projections in different orders, so that a query can be solved using the most advantageous projection.

3. 使用多种编码方案之一对列进行高度压缩。

3.  Heavily compressed columns using one of several coding schemes.

4. 面向列的优化器和执行器,其原语与面向行的系统不同。

4.  A column-oriented optimizer and executor, with different primitives than in a row-oriented system.

5. 通过使用足够数量的重叠投影的 K-safety 实现高可用性和改进的性能。

5.  High availability and improved performance through K-safety using a sufficient number of overlapping projections.

6. 使用快照隔离来避免2PC和查询锁定。

6.  The use of snapshot isolation to avoid 2PC and locking for queries.

应该强调的是,虽然其中许多主题与过去孤立研究的事物有相似之处,但正是它们在真实系统中的结合才使便利店变得有趣和独特。

It should be emphasized that while many of these topics have parallels with things that have been studied in isolation in the past, it is their combination in a real system that make C-Store interesting and unique.

本文的其余部分安排如下。在第 2 节中,我们介绍了 C-Store 实现的数据模型。我们将在第 3 节中探讨 C-Store 的 RS 部分的设计,然后在第 4 节中探讨 WS 组件。在第 5 节中,我们考虑将 C-Store 数据结构分配给网格中的节点,然后在第 6 节中介绍 C-Store 更新和事务。第 7 节讨论 C-Store 的元组移动器组件,第 8 节介绍查询优化器和执行器。在第 9 节中,我们将便利商店的性能与流行的商业行商店和流行的商业列商店所实现的性能进行比较。在 TPC-H 样式查询上,C-Store 比任一备用系统都要快得多。然而,必须指出的是,性能比较尚未完全完成;我们还没有完全集成 WS 和元组移动器,它们的开销可能很大。最后,第 10 节和第 11 节讨论了相关的先前工作和我们的结论。

The rest of this paper is organized as follows. In Section 2 we present the data model implemented by C-Store. We explore in Section 3 the design of the RS portion of C-Store, followed in Section 4 by the WS component. In Section 5 we consider the allocation of C-Store data structures to nodes in a grid, followed by a presentation of C-Store updates and transactions in Section 6. Section 7 treats the tuple mover component of C-Store, and Section 8 presents the query optimizer and executor. In Section 9 we present a comparison of C-Store performance to that achieved by both a popular commercial row store and a popular commercial column store. On TPC-H style queries, C-Store is significantly faster than either alternate system. However, it must be noted that the performance comparison is not fully completed; we have not fully integrated the WS and tuple mover, whose overhead may be significant. Finally, Sections 10 and 11 discuss related previous work and our conclusions.

2  数据模型

2  Data Model

C-Store 支持标准关系逻辑数据模型,其中数据库由命名表的集合组成,每个表都有一个命名的属性(列)集合。与大多数关系系统一样,C-Store 表中的属性(或属性集合)可以形成唯一的主键,或者是引用另一个表中的主键的外键。C-Store 查询语言假定为 SQL,具有标准 SQL 语义。C-Store 中的数据并非使用此逻辑数据模型进行物理存储。大多数行存储直接实现物理表,然后添加各种索引来加速访问,而 C-Store 仅实现投影。具体来说,C-Store 投影被锚定位于给定逻辑表 T 上,并且包含该表中的一个或多个属性。另外,投影可以包含来自其他表的任意数量的其他属性,只要存在从锚表到包含属性的表的n:1(即,键)关系的序列。

C-Store supports the standard relational logical data model, where a database consists of a collection of named tables, each with a named collection of attributes (columns). As in most relational systems, attributes (or collections of attributes) in C-Store tables can form a unique primary key or be a foreign key that references a primary key in another table. The C-Store query language is assumed to be SQL, with standard SQL semantics. Data in C-Store is not physically stored using this logical data model. Whereas most row stores implement physical tables directly and then add various indexes to speed access, C-Store implements only projections. Specifically, a C-Store projection is anchored on a given logical table, T, and contains one or more attributes from this table. In addition, a projection can contain any number of other attributes from other tables, as long as there is a sequence of n:1 (i.e., foreign key) relationships from the anchor table to the table containing an attribute.

表 1   EMP 数据示例

Table 1  Sample EMP data

图像

为了形成投影,我们从 T 投影感兴趣的属性,保留任何重复行,并执行适当的基于值的外键连接序列,以从非锚定表获取属性。因此,投影具有与其锚表相同的行数。当然,可以允许更复杂的预测,但我们相信这个简单的方案将满足我们的需求,同时确保高性能。我们注意到,我们使用术语“投影”与常见做法略有不同,因为我们不存储从中导出投影的基表。

To form a projection, we project the attributes of interest from T, retaining any duplicate rows, and perform the appropriate sequence of value-based foreign-key joins to obtain the attributes from the non-anchor table(s). Hence, a projection has the same number of rows as its anchor table. Of course, much more elaborate projections could be allowed, but we believe this simple scheme will meet our needs while ensuring high performance. We note that we use the term projection slightly differently than is common practice, as we do not store the base table(s) from which the projection is derived.

我们将表t上的第 i 个投影表示为ti,后跟投影中字段的名称。来自其他表的属性前面带有它们来自的逻辑表的名称。在本节中,我们考虑标准 EMP(姓名、年龄、薪水、部门)和 DEPT(dname、楼层)关系的示例。EMP 数据示例如表 1所示。这些表的一组可能的预测如示例 1 所示。

We denote the ith projection over table t as ti, followed by the names of the fields in the projection. Attributes from other tables are prepended with the name of the logical table they come from. In this section, we consider an example for the standard EMP(name, age, salary, dept) and DEPT(dname, floor) relations. Sample EMP data is shown in Table 1. One possible set of projections for these tables could be as shown in Example 1.

EMP1 (name, age)

EMP1 (name, age)

EMP2 (dept, age, DEPT.floor)

EMP2 (dept, age, DEPT.floor)

EMP3 (name, salary)

EMP3 (name, salary)

DEPT1 (dname, floor)

DEPT1 (dname, floor)

示例 1:EMP 和 DEPT 的可能预测。

Example 1: Possible projections for EMP and DEPT.

投影中的元组按列存储。因此,如果投影中有 K 个属性,则将有 K 个数据结构,每个数据结构存储一个列,每个列都按相同的排序键排序。排序键可以是投影中的任意列。投影中的元组按照键从左到右的顺序排序。

Tuples in a projection are stored column-wise. Hence, if there are K attributes in a projection, there will be K data structures, each storing a single column, each of which is sorted on the same sort key. The sort key can be any column or columns in the projection. Tuples in a projection are sorted on the key(s) in left to right order.

我们通过将排序键附加到由竖线分隔的投影来指示投影的排序顺序。上述预测的可能顺序是:

We indicate the sort order of a projection by appending the sort key to the projection separated by a vertical bar. A possible ordering for the above projections would be:

EMP1 (name, age| age)

EMP1 (name, age| age)

EMP2 (dept, age, DEPT.floor| DEPT.floor)

EMP2 (dept, age, DEPT.floor| DEPT.floor)

EMP3 (name, salary| salary)

EMP3 (name, salary| salary)

DEPT1 (dname, floor| floor)

DEPT1 (dname, floor| floor)

示例 2:示例 1 中带有排序顺序的投影。

Example 2: Projections in Example 1 with sort orders.

最后,每个投影被水平划分为 1 个或多个,这些段被赋予一个段标识符Sid,其中 Sid > 0。C-Store 仅支持对投影的排序键进行基于值的分区。因此,给定投影的每个段都与投影的排序键的键范围相关联。此外,所有键范围的集合对空间进行分区。

Lastly, every projection is horizontally partitioned into 1 or more segments, which are given a segment identifier, Sid, where Sid > 0. C-Store supports only value-based partitioning on the sort key of a projection. Hence, each segment of a given projection is associated with a key range of the sort key for the projection. Moreover, the set of all key ranges partitions the key space.

显然,要回答 C-Store 中的任何 SQL 查询,数据库中的每个表都必须有一组覆盖投影,以便每个表中的每一列都存储在至少一个投影中。然而,C-Store 还必须能够从存储段的集合中重建完整的表行。为此,需要连接来自不同投影的段,我们使用存储键连接索引来完成此操作。

Clearly, to answer any SQL query in C-Store, there must be a covering set of projections for every table in the database such that every column in every table is stored in at least one projection. However, C-Store must also be able to reconstruct complete rows of tables from the collection of stored segments. To do this, it will need to join segments from different projections, which we accomplish using storage keys and join indexes.

存储密钥。每个段将每列的每个数据值与存储键 SK 相关联。来自同一段中具有匹配存储键的不同列的值属于同一逻辑行。我们使用术语“记录”“元组”来指代段中的一行。存储键在 RS 中编号为 1, 2, 3, …,并且不进行物理存储,而是根据元组在列中的物理位置推断出来(请参阅下面的第 3 节)。存储键在 WS 中物理存在并表示为整数,大于 RS 中任何段的最大整数存储键。

Storage Keys. Each segment associates every data value of every column with a storage key, SK. Values from different columns in the same segment with matching storage keys belong to the same logical row. We refer to a row of a segment using the term record or tuple. Storage keys are numbered 1, 2, 3, … in RS and are not physically stored, but are inferred from a tuple’s physical position in the column (see Section 3 below.) Storage keys are physically present in WS and are represented as integers, larger than the largest integer storage key for any segment in RS.

加入指数。为了根据表 T 的各种投影重建表 T 中的所有记录,C-Store 使用连接索引。如果T1T2是覆盖表T 的两个投影,则从T1中的 M 段到T2中的 N 段的连接索引在逻辑上是 M 个表的集合, T1的每个段 S 一个表,由以下形式的行组成:

Join Indices. To reconstruct all of the records in a table T from its various projections, C-Store uses join indexes. If T1 and T2 are two projections that cover a table T, a join index from the M segments in T1 to the N segments in T2 is logically a collection of M tables, one per segment, S, of T1 consisting of rows of the form:

(s: SID in T2, k: Storage Key in Segment s)

(s: SID in T2, k: Storage Key in Segment s)

这里, T1段中给定元组的连接索引中的条目包含T2中相应(连接)元组的段 ID 和存储键。由于所有连接索引都位于锚定在同一个表上的投影之间,因此这始终是一对一的映射。连接索引的另一种观点是,它采用按某种顺序 O 排序的T1 ,并在逻辑上将其重新排序为T2的顺序 O' 。

Here, an entry in the join index for a given tuple in a segment of T1 contains the segment ID and storage key of the corresponding (joining) tuple in T2. Since all join indexes are between projections anchored at the same table, this is always a one-to-one mapping. An alternative view of a join index is that it takes T1, sorted in some order O, and logically resorts it into the order, O’ of T2.

图像

图 2  从 EMP3 到 EMP1 的连接索引。

Figure 2  A join index from EMP3 to EMP1.

为了从T1、...、Tk的段重建T ,必须能够通过一组连接索引找到一条路径,将T的每个属性映射到某种排序顺序 O*。路径是连接索引的集合,其起源于由某个投影T i指定的排序顺序,它穿过零个或多个中间连接索引,并以按顺序 O* 排序的投影结束。例如,为了能够根据示例 2 中的投影重建 EMP 表,我们至少需要两个连接索引。如果我们选择年龄作为通用排序顺序,我们可以构建两个索引,将 EMP2 和 EMP3 映射到 EMP1 的排序。或者,我们可以创建一个映射到的连接索引EMP2EMP3以及将 EMP3 映射到 EMP1 的一个。图 2EMP3显示了映射到的连接索引的简单示例EMP1,假设每个投影都有一个段 (SID = 1)。例如, 的第一个条目EMP3( Bob, 10K) 对应于 的第二个条目EMP1,因此连接索引的第一个条目的存储键为 2。在实践中,我们期望将每一列存储在多个投影中,从而允许我们维护连接索引相对较少。这是因为在存在更新的情况下,连接索引的存储和维护成本非常昂贵,因为对投影的每次修改都需要更新指向或引出的每个连接索引。

In order to reconstruct T from the segments of T1, …, Tk it must be possible to find a path through a set of join indices that maps each attribute of T into some sort order O*. A path is a collection of join indexes originating with a sort order specified by some projection, T i, that passes through zero or more intermediate join indices and ends with a projection sorted in order O*. For example, to be able to reconstruct the EMP table from projections in Example 2, we need at least two join indices. If we choose age as a common sort order, we could build two indices that map EMP2 and EMP3 to the ordering of EMP1. Alternatively, we could create a join index that maps EMP2 to EMP3 and one that maps EMP3 to EMP1. Figure 2 shows a simple example of a join index that maps EMP3 to EMP1, assuming a single segment (SID = 1) for each projection. For example, the first entry of EMP3, (Bob, 10K), corresponds to the second entry of EMP1, and thus the first entry of the join index has storage key 2. In practice, we expect to store each column in several projections, thereby allowing us to maintain relatively few join indices. This is because join indexes are very expensive to store and maintain in the presence of updates, since each modification to a projection requires every join index that points into or out of it to be updated as well.

数据库中的投影段及其连接索引必须分配给 C-Store 系统中的各个节点。C-Store 管理员可以选择指定数据库中的表必须是K-safe 的。在这种情况下,网格中 K 个节点的丢失仍然允许重建数据库中的所有表(即,尽管有 K 个失败站点,但必须存在一组覆盖的投影和一组映射到某些常见排序顺序的连接索引。 ) 当发生故障时,C-Store 只是以 K-1 安全性继续运行,直到故障得到修复并且节点恢复正常速度。我们目前正在研究快速算法来实现这一目标。

The segments of the projections in a database and their connecting join indexes must be allocated to the various nodes in a C-Store system. The C-Store administrator can optionally specify that the tables in a database must be K-safe. In this case, the loss of K nodes in the grid will still allow all tables in a database to be reconstructed (i.e., despite the K failed sites, there must exist a covering set of projections and a set of join indices that map to some common sort order.) When a failure occurs, C-Store simply continues with K-1 safety until the failure is repaired and the node is brought back up to speed. We are currently working on fast algorithms to accomplish this.

因此,C-Store 物理 DBMS 设计问题是确定为数据库中的逻辑表集合创建的投影、段、排序键和连接索引的集合。此物理模式必须提供 K-安全性以及由 C-Store 管理员提供的给定训练工作负载的最佳整体性能,但要求不超过给定的空间预算B。此外,C-Store 可以接受指示保存所有查询的日志,定期用作训练工作负载。由于没有足够的熟练 DBA 可供使用,我们正在编写一个自动模式设计工具。[ PAPA04 ]中解决了类似的问题

Thus, the C-Store physical DBMS design problem is to determine the collection of projections, segments, sort keys, and join indices to create for the collection of logical tables in a database. This physical schema must give K-safety as well as the best overall performance for a given training workload, provided by the C-Store administrator, subject to requiring no more than a given space budget, B. Additionally, C-Store can be instructed to keep a log of all queries to be used periodically as the training workload. Because there are not enough skilled DBAs to go around, we are writing an automatic schema design tool. Similar issues are addressed in [PAPA04]

现在我们转向 C-Store 中投影、段、存储键和连接索引的表示。

We now turn to the representation of projections, segments, storage keys, and join indexes in C-Store.

3   RS

3  RS

RS 是一种读取优化的列存储。因此,任何投影的任何段都被分解为其组成列,并且每列都按照投影的排序键的顺序存储。RS 中每个元组的存储键是段中记录的序号。这个存储密钥不是被存储而是根据需要计算的。

RS is a read-optimized column store. Hence any segment of any projection is broken into its constituent columns, and each column is stored in order of the sort key for the projection. The storage key for each tuple in RS is the ordinal number of the record in the segment. This storage key is not stored but calculated as needed.

3.1  编码方案

3.1  Encoding Schemes

RS 中的列使用 4 种编码之一进行压缩。为列选择的编码取决于其顺序(即,该列是按该列中的值排序(自序)还是按同一投影中其他列的相应值排序(外序),以及不同的比例它包含的值。我们在下面描述这些编码。

Columns in the RS are compressed using one of 4 encodings. The encoding chosen for a column depends on its ordering (i.e., is the column ordered by values in that column (self-order) or by corresponding values of some other column in the same projection (foreign-order), and the proportion of distinct values it contains. We describe these encodings below.

类型 1:自序,很少有不同的值。使用类型 1 编码进行编码的列由三元组序列 ( v, f, n ) 表示,其中v是存储在列中的值,f是v首次出现在列中的位置,n是数字v出现在列中的次数。例如,如果一组 4 出现在位置 12-18 中,则会由条目 (4, 12, 7) 捕获。对于自排序的列,这需要对列中的每个不同值使用一个三元组。为了支持对此类列中的值进行搜索查询,类型 1 编码的列具有聚集 B 树索引超过他们的价值领域。由于 RS 没有在线更新,因此我们可以密集打包索引,不留任何空白空间。此外,对于大磁盘块(例如,64-128K),该索引的高度可以保持较小(例如,2或更小)。

Type 1: Self-order, few distinct values. A column encoded using Type 1 encoding is represented by a sequence of triples, (v, f, n) such that v is a value stored in the column, f is the position in the column where v first appears, and n is the number of times v appears in the column. For example, if a group of 4’s appears in positions 12-18, this is captured by the entry, (4, 12, 7). For columns that are self-ordered, this requires one triple for each distinct value in the column. To support search queries over values in such columns, Type 1-encoded columns have clustered B-tree indexes over their value fields. Since there are no online updates to RS, we can densepack the index leaving no empty space. Further, with large disk blocks (e.g., 64-128K), the height of this index can be kept small (e.g., 2 or less).

类型 2:外国订单,很少有不同的值。使用类型 2 编码进行编码的列由元组序列 ( v, b )表示,其中v是存储在列中的值,b是指示存储该值的位置的位图。例如,给定一列整数 0,0,1,1,2,1,0,2,1,我们可以将其进行类型 2 编码为三对:(0,110000100)、(1, 001101001) 和(2,000010010)。由于每个位图都是稀疏的,因此对其进行游程编码以节省空间。为了有效地查找类型 2 编码列的第 i 个值,我们包括“偏移索引”:将列中的位置映射到该列中包含的值的 B 树。

Type 2: Foreign-order, few distinct values. A column encoded using Type 2 encoding is represented by a sequence of tuples, (v, b) such that v is a value stored in the column and b is a bitmap indicating the positions in which the value is stored. For example, given a column of integers 0,0,1,1,2,1,0,2,1, we can Type 2-encode this as three pairs: (0,110000100), (1, 001101001), and (2,000010010). Since each bitmap is sparse, it is run length encoded to save space. To efficiently find the i-th value of a type 2-encoded column, we include “offset indexes”: B-trees that map positions in a column to the values contained in that column.

类型 3:自我排序,许多不同的值。此方案的想法是将列中的每个值表示为与列中前一个值的增量。因此,例如,由值 1,4,7,7,8,12 组成的列将由序列表示:1,3,3,0,1,4,使得序列中的第一个条目是列中的第一个值,每个后续条目都是与前一个值的增量。Type-3 编码是该压缩方案的面向块的形式,使得每个块的第一个条目是列中的值及其关联的存储键,并且每个后续值都是与前一个值的增量。该方案让人想起 VSAM 编码 B 树索引键的方式 [ VSAM04]。同样,块级的密集 B 树可用于索引这些编码对象。

Type 3: Self-order, many distinct values. The idea for this scheme is to represent every value in the column as a delta from the previous value in the column. Thus, for example, a column consisting of values 1,4,7,7,8,12 would be represented by the sequence: 1,3,3,0,1,4, such that the first entry in the sequence is the first value in the column, and every subsequent entry is a delta from the previous value. Type-3 encoding is a block-oriented form of this compression scheme, such that the first entry of every block is a value in the column and its associated storage key, and every subsequent value is a delta from the previous value. This scheme is reminiscent of the way VSAM codes B-tree index keys [VSAM04]. Again, a densepack B-tree tree at the block-level can be used to index these coded objects.

类型 4:外国订单,许多不同的值。如果存在大量值,那么保留这些值未编码可能是有意义的。然而,我们仍在研究针对这种情况的可能的压缩技术。密集 B 树仍可用于索引。

Type 4: Foreign-order, many distinct values. If there are a large number of values, then it probably makes sense to leave the values unencoded. However, we are still investigating possible compression techniques for this situation. A densepack B-tree can still be used for the indexing.

3.2  连接索引

3.2  Join Indexes

必须使用连接索引来连接锚定在同一个表上的各个投影。如前所述,连接索引是 (sid, storage_key) 对的集合。这两个字段中的每一个都可以存储为普通列。

Join indexes must be used to connect the various projections anchored at the same table. As noted earlier, a join index is a collection of (sid, storage_key) pairs. Each of these two fields can be stored as normal columns.

关于存储连接索引的位置存在物理数据库设计含义,我们将在下一节中解决这些问题。另外,连接索引必须集成RS和WS;因此,我们也在下一节中重新审视他们的设计。

There are physical database design implications concerning where to store join indexes, and we address these in the next section. In addition, join indexes must integrate RS and WS; hence, we revisit their design in the next section as well.

4 WS

4 WS

为了避免编写两个优化器,WS也是列存储,并实现与RS相同的物理DBMS设计。因此,WS 中存在相同的投影和连接索引。然而,存储表示方式截然不同,因为 WS 必须能够高效地进行事务更新。

In order to avoid writing two optimizers, WS is also a column store and implements the identical physical DBMS design as RS. Hence, the same projections and join indexes are present in WS. However, the storage representation is drastically different because WS must be efficiently updatable transactionally.

每个记录的存储密钥 SK 显式存储在每个 WS 段中。为表 T 中逻辑元组的每次插入赋予唯一的 SK。执行引擎必须确保在存储逻辑元组数据的每个投影中记录该 SK。该SK是一个整数,大于数据库中最大段中的记录数。

The storage key, SK, for each record is explicitly stored in each WS segment. A unique SK is given to each insert of a logical tuple in a table T. The execution engine must ensure that this SK is recorded in each projection that stores data for the logical tuple. This SK is an integer, larger than the number of records in the largest segment in the database.

为了简单性和可扩展性,WS以与RS相同的方式进行水平分区。因此,RS段和WS段之间存在1:1的映射。(sid, storage_key) 对标识这些容器中任一容器中的记录。

For simplicity and scalability, WS is horizontally partitioned in the same way as RS. Hence, there is a 1:1 mapping between RS segments and WS segments. A (sid, storage_key) pair identifies a record in either of these containers.

由于我们假设 WS 的大小相对于 RS 来说很小,因此我们不努力压​​缩数据值;相反,我们直接表示所有数据。因此,每个投影都使用 B 树索引来维护逻辑排序键顺序。

Since we assume that WS is trivial in size relative to RS, we make no effort to compress data values; instead we represent all data directly. Therefore, each projection uses B-tree indexing to maintain a logical sort-key order.

WS 投影中的每一列都表示为 ( v, sk )对的集合,其中v是列中的值,sk是其对应的存储键。每对在第二个字段上表示在传统 B 树中。每个投影的排序键还由 ( s, sk ) 对表示,其中s是排序键值,sk是描述s所在位置的存储键首先出现。同样,该结构在排序键字段上表示为传统的 B 树。要使用排序键执行搜索,可以使用后一个 B 树来查找感兴趣的存储键,然后使用前一个 B 树集合来查找记录中的其他字段。

Every column in a WS projection is represented as a collection of pairs, (v, sk), such that v is a value in the column and sk is its corresponding storage key. Each pair is represented in a conventional B-tree on the second field. The sort key(s) of each projection is additionally represented by pairs (s, sk) such that s is a sort key value and sk is the storage key describing where s first appears. Again, this structure is represented as a conventional B-tree on the sort key field(s). To perform searches using the sort key, one uses the latter B-tree to find the storage keys of interest, and then uses the former collection of B-trees to find the other fields in the record.

现在可以完整地描述连接索引。每个投影都表示为一组线段对的集合,一个在 WS 中,一个在 RS 中。对于“发送者”中的每条记录,我们必须将相应记录的 sid 和存储密钥存储在“接收者”中。以与“发送”投影相同的方式对连接索引进行水平分区,然后将连接索引分区与其关联的发送段放在一起,这将很有用。实际上,每个(sid,存储密钥)对都是一个指向记录的指针,该记录可以位于 RS 或 WS 中。

Join indexes can now be fully described. Every projection is represented as a collection of pairs of segments, one in WS and one in RS. For each record in the “sender,” we must store the sid and storage key of a corresponding record in the “receiver.” It will be useful to horizontally partition the join index in the same way as the “sending” projection and then to co-locate join index partitions with the sending segment they are associated with. In effect, each (sid, storage key) pair is a pointer to a record which can be in either the RS or WS.

5  存储管理

5  Storage Management

存储管理问题是将段分配给网格系统中的节点;C-Store 将使用存储分配器自动执行此操作。很明显,投影的单个片段中的所有列都应该位于同一位置。如上所述,连接索引应与其“发送者”段位于同一位置。此外,每个 WS 段将与包含相同密钥范围的 RS 段位于同一位置。

The storage management issue is the allocation of segments to nodes in a grid system; C-Store will perform this operation automatically using a storage allocator. It seems clear that all columns in a single segment of a projection should be co-located. As noted above, join indexes should be co-located with their “sender” segments. Also, each WS segment will be co-located with the RS segments that contain the same key range.

利用这些约束,我们正在开发一个分配器。该系统将执行初始分配,以及负载不平衡时的重新分配。该软件的详细信息超出了本文的范围。

Using these constraints, we are working on an allocator. This system will perform initial allocation, as well as reallocation when load becomes unbalanced. The details of this software are beyond the scope of this paper.

由于一切都是列,因此存储只是列集合的持久性。我们的分析表明,原始设备相对于当今的文件系统几乎没有什么好处。因此,大列(兆字节)存储在底层操作系统中的单独文件中。

Since everything is a column, storage is simply the persistence of a collection of columns. Our analysis shows that a raw device offers little benefit relative to today’s file systems. Hence, big columns (megabytes) are stored in individual files in the underlying operating system.

6  更新和交易

6  Updates and Transactions

插入在 WS 中表示为新对象的集合,每个投影每列一个,加上排序键数据结构。对应于单个逻辑记录的所有插入具有相同的存储键。存储密钥在接收更新的站点分配。为了防止 C-Store 节点需要彼此同步来分配存储密钥,每个节点都维护一个本地唯一的计数器,将其本地站点 ID 附加到该计数器以生成全局唯一的存储密钥。WS 中的键将与 RS 存储键保持一致,因为我们将该计数器的初始值设置为比 RS 中最大键大 1。

An insert is represented as a collection of new objects in WS, one per column per projection, plus the sort key data structure. All inserts corresponding to a single logical record have the same storage key. The storage key is allocated at the site where the update is received. To prevent C-Store nodes from needing to synchronize with each other to assign storage keys, each node maintains a locally unique counter to which it appends its local site id to generate a globally unique storage key. Keys in the WS will be consistent with RS storage keys because we set the initial value of this counter to be one larger than the largest key in RS.

我们正在 BerkeleyDB [ SLEE04 ]之上构建 WS ;我们使用该包中的 B 树结构来支持我们的数据结构。因此,对投影的每次插入都会导致不同磁盘页上的物理插入集合,每个投影每列一个。为了避免性能不佳,我们计划利用一个非常大的主内存缓冲池,通过主存储每字节成本的直线下降来降低成本。因此,我们预计“热”WS 数据结构主要驻留在主内存中。

We are building WS on top of BerkeleyDB [SLEE04]; we use the B-tree structures in that package to support our data structures. Hence, every insert to a projection results in a collection of physical inserts on different disk pages, one per column per projection. To avoid poor performance, we plan to utilize a very large main memory buffer pool, made affordable by the plummeting cost per byte of primary storage. As such, we expect “hot” WS data structures to be largely main memory resident.

C-Store 的删除处理受到我们的锁定策略的影响。具体来说,C-Store 预计会出现大量具有大型读取集的即席查询,其中散布着少量覆盖少量记录的 OLTP 事务。如果 C-Store 使用传统锁定,则可能会观察到大量的锁争用,从而导致性能非常差。

C-Store’s processing of deletes is influenced by our locking strategy. Specifically, C-Store expects large numbers of ad-hoc queries with large read sets interspersed with a smaller number of OLTP transactions covering few records. If C-Store used conventional locking, then substantial lock contention would likely be observed, leading to very poor performance.

相反,在 C-Store 中,我们使用快照隔离来隔离只读事务。快照隔离的工作原理是允许只读事务访问最近一段时间的数据库,在此之前我们可以保证不存在未提交的事务。因此,在使用快照隔离时,我们不需要设置任何锁。我们将过去的最近时间称为哪种快照隔离可以运行高水位线(HWM),并引入低开销机制来跟踪其在我们的多站点环境中的价值。如果我们让只读事务任意设置其有效时间,那么我们将不得不支持一般的时间旅行,这是一项非常昂贵的任务。因此,还有一个低水位线(LWM),它是只读事务可以运行的最早有效时间。更新事务继续设置读写锁并遵守严格的两阶段锁定,如第 6.2 节所述。

Instead, in C-Store, we isolate read-only transactions using snapshot isolation. Snapshot isolation works by allowing read-only transactions to access the database as of some time in the recent past, before which we can guarantee that there are no uncommitted transactions. For this reason, when using snapshot isolation, we do not need to set any locks. We call the most recent time in the past at which snapshot isolation can run the high water mark (HWM) and introduce a low-overhead mechanism for keeping track of its value in our multi-site environment. If we let read-only transactions set their effective time arbitrarily, then we would have to support general time travel, an onerously expensive task. Hence, there is also a low water mark (LWM) which is the earliest effective time at which a read-only transaction can run. Update transactions continue to set read and write locks and obey strict two-phase locking, as described in Section 6.2.

6.1  提供快照隔离

6.1  Providing Snapshot Isolation

快照隔离的关键问题是确定 WS 和 RS 中的哪些记录对于在有效时间 ET 运行的只读事务应该可见。为了提供快照隔离,我们无法就地执行更新。相反,更新会变成插入和删除。因此,如果记录在 ET 之前插入并在 ET 之后删除,则该记录是可见的。为了在不需要大量空间预算的情况下做出这一决定,我们使用粗粒度“纪元”(将在第 6.1.1 节中描述)作为时间戳的单位。因此,我们维护一个插入向量(IV)对于 WS 中的每个投影段,其中包含每个记录插入该记录的纪元。我们对元组移动器(在第 7 节中描述)进行编程,以确保在 LWM 之后没有插入 RS 中的记录。因此,RS不需要维护插入向量。此外,我们为每个投影维护一个删除记录向量(DRV),每个投影记录有一个条目,如果元组尚未删除,则包含 0;否则,该条目包含删除该元组的时期。由于 DRV 非常稀疏(大部分为零),因此可以使用前面描述的类型 2 算法对其进行紧凑编码。我们将 DRV 存储在 WS 中,因为它必须是可更新的。运行时系统现在可以咨询IVDRV逐条记录地计算每个查询的可见性。

The key problem in snapshot isolation is determining which of the records in WS and RS should be visible to a read-only transaction running at effective time ET. To provide snapshot isolation, we cannot perform updates in place. Instead, an update is turned into an insert and a delete. Hence, a record is visible if it was inserted before ET and deleted after ET. To make this determination without requiring a large space budget, we use coarse granularity “epochs,” to be described in Section 6.1.1, as the unit for timestamps. Hence, we maintain an insertion vector (IV) for each projection segment in WS, which contains for each record the epoch in which the record was inserted. We program the tuple mover (described in Section 7) to ensure that no records in RS were inserted after the LWM. Hence, RS need not maintain an insertion vector. In addition, we maintain a deleted record vector (DRV) for each projection, which has one entry per projection record, containing a 0 if the tuple has not been deleted; otherwise, the entry contains the epoch in which the tuple was deleted. Since the DRV is very sparse (mostly zeros), it can be compactly coded using the type 2 algorithm described earlier. We store the DRV in the WS, since it must be updatable. The runtime system can now consult IV and DRV to make the visibility calculation for each query on a record-by-record basis.

6.1.1 维持高水位线

6.1.1  Maintaining the High Water Mark

为了维护 HWM,我们指定一个站点为时间戳授权机构(TA),负责为其他站点分配时间戳。这个想法是将时间划分为多个纪元;我们将纪元数定义为自时间开始以来已经过去的纪元数。我们预计纪元相对较长——例如,每个纪元很多秒,但确切的持续时间可能因部署而异。我们将初始 HWM 定义为 epoch 0,并从 1 开始当前 epoch。 TA 定期决定将系统移动到下一个 epoch;它向每个站点发送纪元结束消息,每个站点将当前纪元e递增到e + 1,从而导致到达的新事务以时间戳e + 1运行。每个站点等待在纪元e (或更早的纪元)中开始的所有事务完成,然后向 TA发送纪元完成消息。一旦 TA 收到来自所有站点的历元e的历元完整消息,它会将 HWM 设置为e,并将该值发送到每个站点。图 3说明了此过程。

To maintain the HWM, we designate one site the timestamp authority (TA) with the responsibility of allocating timestamps to other sites. The idea is to divide time into a number of epochs; we define the epoch number to be the number of epochs that have elapsed since the beginning of time. We anticipate epochs being relatively long – e.g., many seconds each, but the exact duration may vary from deployment to deployment. We define the initial HWM to be epoch 0 and start current epoch at 1. Periodically, the TA decides to move the system to the next epoch; it sends a end of epoch message to each site, each of which increments current epoch from e to e + 1, thus causing new transactions that arrive to be run with a timestamp e + 1. Each site waits for all the transactions that began in epoch e (or an earlier epoch) to complete and then sends an epoch complete message to the TA. Once the TA has received epoch complete messages from all sites for epoch e, it sets the HWM to be e, and sends this value to each site. Figure 3 illustrates this process.

图像

图 3  显示 HWM 选择算法如何工作的示意图。灰色箭头表示从 TA 到站点的消息,反之亦然。当 epoch e的所有事务都已提交时,我们可以开始读取带有时间戳e的元组。请注意,尽管 HWM 递增时 T4 仍在执行,但只读事务不会看到其更新,因为它运行在 epoch e + 1 中。

Figure 3  Illustration showing how the HWM selection algorithm works. Gray arrows indicate messages from the TA to the sites or vice versa. We can begin reading tuples with timestamp e when all transactions from epoch e have committed. Note that although T4 is still executing when the HWM is incremented, read-only transactions will not see its updates because it is running in epoch e + 1.

在 TA 广播了值为e的新 HWM 后,只读事务可以开始从 epoch e或更早的时间读取数据,并确保该数据已被提交。为了允许用户在查询应该开始时引用特定的现实世界时间,我们维护一个将纪元号映射到时间的表,并从最接近用户指定时间的纪元开始查询。

After the TA has broadcast the new HWM with value e, read-only transactions can begin reading data from epoch e or earlier and be assured that this data has been committed. To allow users to refer to a particular real-world time when their query should start, we maintain a table mapping epoch numbers to times, and start the query as of the epoch nearest to the user-specified time.

为了避免纪元数量无限制地增长并消耗额外的空间,我们计划“回收”不再需要的纪元。我们将通过“包装”时间戳来实现这一点,从而允许我们像其他协议(例如 TCP)一样重用旧的纪元号。在大多数仓库应用程序中,记录会保存特定的时间,例如 2 年。因此,我们只跟踪任何 DRV 中最旧的纪元,并确保将纪元包裹到零不会溢出。

To avoid epoch numbers from growing without bound and consuming extra space, we plan to “reclaim” epochs that are no longer needed. We will do this by “wrapping” timestamps, allowing us to reuse old epoch numbers as in other protocols, e.g., TCP. In most warehouse applications, records are kept for a specific amount of time, say 2 years. Hence, we merely keep track of the oldest epoch in any DRV, and ensure that wrapping epochs through zero does not overrun.

为了应对 epoch 无法有效包裹的环境,我们别无选择,只能扩大 epoch 的“包裹长度”或 epoch 的大小。

To deal with environments for which epochs cannot effectively wrap, we have little choice but to enlarge the “wrap length” of epochs or the size of an epoch.

6.2  基于锁的并发控制

6.2  Locking-based Concurrency Control

读写事务使用严格的两阶段锁定来进行并发控制[ GRAY92 ]。每个站点都对运行时系统读取或写入的数据对象设置锁,从而实现与大多数分布式数据库中一样的分布式锁表。标准预写日志记录用于恢复目的;我们使用 NO-FORCE、STEAL 策略 [ GRAY92 ],但与传统的日志记录和锁定实现不同,我们只记录 UNDO 记录,执行 REDO如第 6.3 节中所述,并且我们不使用严格的两阶段提交,从而避免了下面第 6.2.1 节中所述的 PREPARE 阶段。

Read-write transactions use strict two-phase locking for concurrency control [GRAY92]. Each site sets locks on data objects that the runtime system reads or writes, thereby implementing a distributed lock table as in most distributed databases. Standard write-ahead logging is employed for recovery purposes; we use a NO-FORCE, STEAL policy [GRAY92] but differ from the traditional implementation of logging and locking in that we only log UNDO records, performing REDO as described in Section 6.3, and we do not use strict two-phase commit, avoiding the PREPARE phase as described in Section 6.2.1 below.

当然,锁定可能会导致死锁。我们通过中止死锁事务之一的标准技术通过超时解决死锁。

Locking can, of course, result in deadlock. We resolve deadlock via timeouts through the standard technique of aborting one of the deadlocked transactions.

6.2.1 分布式COMMIT处理

6.2.1  Distributed COMMIT Processing

在C-Store中,每个事务都有一个master,负责将事务对应的工作单元分配到适当的站点,并确定每个事务的最终提交状态。该协议与两阶段提交 (2PC) 的不同之处在于不发送 PREPARE 消息。当master收到事务的COMMIT语句时,它会等待,直到所有worker完成所有未完成的操作,然后发出提交中止))向每个站点发送消息。一旦站点收到提交消息,它就可以释放与该事务相关的所有锁并删除该事务的UNDO日志。该协议与 2PC 不同,因为主站不准备工作站点。这意味着主站告诉站点在将与事务相关的任何更新或日志记录写入稳定存储之前可能会崩溃。在这种情况下,发生故障的站点将恢复其状态,这将反映已提交事务的更新以及恢复期间系统中其他站点上的其他投影。

In C-Store, each transaction has a master that is responsible for assigning units of work corresponding to a transaction to the appropriate sites and determining the ultimate commit state of each transaction. The protocol differs from two-phase commit (2PC) in that no PREPARE messages are sent. When the master receives a COMMIT statement for the transaction, it waits until all workers have completed all outstanding actions and then issues a commit (or abort) message to each site. Once a site has received a commit message, it can release all locks related to the transaction and delete the UNDO log for the transaction. This protocol differs from 2PC because the master does not PREPARE the worker sites. This means it is possible for a site the master has told to commit to crash before writing any updates or log records related to a transaction to stable storage. In such cases, the failed site will recover its state, which will reflect updates from the committed transaction, from other projections on other sites in the system during recovery.

6.2.2 事务回滚

6.2.2  Transaction Rollback

当用户或 C-Store 系统中止事务时,可通过在 UNDO 日志中向后扫描来撤消该事务,该日志包含对段的每个逻辑更新的一个条目。我们使用逻辑日志记录(如 ARIES [ MOHA92 ]),因为由于 WS 中数据结构的性质,物理日志记录会导致许多日志记录。

When a transaction is aborted by the user or the C-Store system, it is undone by scanning backwards in the UNDO log, which contains one entry for each logical update to a segment. We use logical logging (as in ARIES [MOHA92]), since physical logging would result in many log records, due to the nature of the data structures in WS.

6.3  恢复

6.3  Recovery

如上所述,崩溃的站点通过从其他投影运行查询(复制状态)来恢复。回想一下,便利店维护 K-安全;即维护足够的投影和连接索引,以便 K 个站点可以在t内失败,恢复的时间,系统将能够保持事务一致性。需要考虑三种情况。如果发生故障的站点没有丢失数据,那么我们可以通过执行更新来使其保持最新状态,这些更新将在网络中的其他位置排队等待。由于我们预计主要是读取环境,因此此前滚操作应该不会很繁重。因此,从最常见类型的崩溃中恢复非常简单。第二种要考虑的情况是灾难性故障,它会同时损坏 RS 和 WS。在这种情况下,我们别无选择,只能从系统中的其他投影和连接索引。唯一需要的功能是能够从远程站点检索辅助数据结构(IV、DRV)。恢复后,必须按上述方式运行排队的更新。如果 WS 损坏但 RS 完好,则出现第三种情况。由于 RS 仅由元组移动器写入,因此我们预计它通常不会受到损坏。因此,我们在下面详细讨论这种常见情况。

As mentioned above, a crashed site recovers by running a query (copying state) from other projections. Recall that C-Store maintains K-safety; i.e. sufficient projections and join indexes are maintained, so that K sites can fail within t, the time to recover, and the system will be able to maintain transactional consistency. There are three cases to consider. If the failed site suffered no data loss, then we can bring it up to date by executing updates that will be queued for it elsewhere in the network. Since we anticipate read-mostly environments, this roll forward operation should not be onerous. Hence, recovery from the most common type of crash is straightforward. The second case to consider is a catastrophic failure which destroys both the RS and WS. In this case, we have no choice but to reconstruct both segments from other projections and join indexes in the system. The only needed functionality is the ability to retrieve auxiliary data structures (IV, DRV) from remote sites. After restoration, the queued updates must be run as above. The third case occurs if WS is damaged but RS is intact. Since RS is written only by the tuple mover, we expect it will typically escape damage. Hence, we discuss this common case in detail below.

6.3.1 高效恢复WS

6.3.1  Efficiently Recovering the WS

考虑恢复站点r上具有排序键K和键范围R的投影的 WS 段Sr以及包含Sr的排序键的其他投影M 1, …, Mb的集合C。元组移动器保证每个 WS 段S包含插入时间戳晚于某个时间t lastmove ( S ) 的所有元组,该时间代表S对应 RS 段中任何记录的最近插入时间。

Consider a WS segment, Sr, of a projection with a sort key K and a key range R on a recovering site r along with a collection C of other projections, M1, …, Mb which contain the sort key of Sr. The tuple mover guarantees that each WS segment, S, contains all tuples with an insertion timestamp later than some time tlastmove(S), which represents the most recent insertion time of any record in S’s corresponding RS segment.

为了恢复,恢复站点首先检查C中的每个投影以查找覆盖关键范围K 的列的集合,其中每个段的tlastmove ( S ) ≤ tlastmove ( Sr )。如果成功,它可以运行以下形式的查询集合:

To recover, the recovering site first inspects every projection in C for a collection of columns that covers the key range K with each segment having tlastmove(S) ≤ tlastmove(Sr). If it succeeds, it can run a collection of queries of the form:

图像

只要上述查询返回存储键,就可以通过适当的连接索引找到段中的其他字段。只要存在覆盖Sr的关键范围的段的集合,该技术就会将Sr恢复到当前的 HWM。执行排队的更新将完成任务。

As long as the above queries return a storage key, other fields in the segment can be found by following appropriate join indexes. As long as there is a collection of segments that cover the key range of Sr, this technique will restore Sr to the current HWM. Executing queued updates will then complete the task.

另一方面,如果没有所需属性的覆盖,则Sr中的一些元组已经被移动到远程站点上的 RS。尽管我们仍然可以查询远程站点,但在不检索 RS 中的所有内容并与本地 RS 段进行差异的情况下识别所需的元组是一项挑战,这显然是一项昂贵的操作。

On the other hand, if there is no cover with the desired property, then some of the tuples in Sr have already been moved to RS on the remote site. Although we can still query the remote site, it is challenging to identify the desired tuples without retrieving everything in RS and differencing against the local RS segment, which is obviously an expensive operation.

为了有效地处理这种情况,如果这种情况变得常见,我们可以强制元组移动器对于它移动的每个元组记录 RS 中的存储键,该存储键对应于元组从 WS 移动之前的存储键和纪元号。该日志可以被截断为 WS 中任何一个上仍然存在的最旧元组的时间戳。站点,因为之前的任何元组都不需要恢复。在这种情况下,恢复站点可以使用远程 WS 段S以及元组移动器日志来解决上面的查询,即使t lastmove ( S ) 在t lastmove ( Sr ) 之后也是如此。

To efficiently handle this case, if it becomes common, we can force the tuple mover to log, for each tuple it moves, the storage key in RS that corresponds to the storage key and epoch number of the tuple before it was moved from WS. This log can be truncated to the timestamp of the oldest tuple still in the WS on any site, since no tuples before that will ever need to be recovered. In this case, the recovering site can use a remote WS segment, S, plus the tuple mover log to solve the query above, even though tlastmove(S) comes after tlastmove(Sr).

r处,我们还必须重建本地存储的任何连接索引的 WS 部分,即Sr是“发送者”。这仅仅需要查询远程“接收器”,然后它们可以在生成元组时计算连接索引,将连接索引的 WS 分区与恢复的列一起传输。

At r, we must also reconstruct the WS portion of any join indexes that are stored locally, i.e. for which Sr is a “sender.” This merely entails querying remote “receivers,” which can then compute the join index as they generate tuples, transferring the WS partition of the join index along with the recovered columns.

7 元  组移动器

7  Tuple Mover

元组移动器的工作是将 WS 段中的元组块移动到相应的 RS 段,并更新该过程中的任何连接索引。它作为后台任务运行,寻找有价值的段对。当它找到一个时,它会在此(RS,WS)段对上执行合并输出过程 MOP 。

The job of the tuple mover is to move blocks of tuples in a WS segment to the corresponding RS segment, updating any join indexes in the process. It operates as a background task looking for worthy segment pairs. When it finds one, it performs a merge-out process, MOP on this (RS, WS) segment pair.

MOP将找到所选WS段中插入时间等于或早于LWM的所有记录,然后将它们分为两组:

MOP will find all records in the chosen WS segment with an insertion time at or before the LWM, and then divides them into two groups:

• 在LWM 或之前删除的内容。这些将被丢弃,因为用户无法在它们存在时运行查询。

•  Ones deleted at or before LWM. These are discarded, because the user cannot run queries as of a time when they existed.

• 未删除或LWM 后删除的内容。这些被转移到RS。

•  Ones that were not deleted, or deleted after LWM. These are moved to RS.

MOP 将创建一个新的 RS 段,我们将其命名为 RS'。然后,它从 RS 段的列中读取块,删除 DRV 中的值小于或等于 LWM 的任何 RS 项,并合并来自 WS 的列值。然后,合并的数据被写出到新的 RS 段,该段随着合并的进行而增长。RS'中一条记录的最近插入时间成为该段的tlastmove并且始终小于或等于 LWM。这种旧主/新主方法将比就地更新策略更有效,因为基本上所有数据对象都会移动。另外,请注意记录在 RS' 中接收新的存储键,因此需要维护连接索引。由于RS项目也可能被删除,因此DRV的维护也是强制性的。一旦 RS' 包含所有 WS 数据并且在 RS' 上修改连接索引,系统就会从 RS 切换到 RS'。现在可以释放旧 RS 使用的磁盘空间。

MOP will create a new RS segment that we name RS’. Then, it reads in blocks from columns of the RS segment, deletes any RS items with a value in the DRV less than or equal to the LWM, and merges in column values from WS. The merged data is then written out to the new RS’ segment, which grows as the merge progresses. The most recent insertion time of a record in RS’ becomes the segment’s new tlastmove and is always less than or equal to the LWM. This old-master/new-master approach will be more efficient than an update-in-place strategy, since essentially all data objects will move. Also, notice that records receive new storage keys in RS’, thereby requiring join index maintenance. Since RS items may also be deleted, maintenance of the DRV is also mandatory. Once RS’ contains all the WS data and join indexes are modified on RS’, the system cuts over from RS to RS’. The disk space used by the old RS can now be freed.

时间戳权威机构定期向每个站点发送一个新的 LWM 纪元号。因此,LWM“追逐”HWM,并且选择它们之间的增量来协调想要历史访问的用户的需求和 WS 空间限制。

Periodically the timestamp authority sends out to each site a new LWM epoch number. Hence, LWM “chases” HWM, and the delta between them is chosen to mediate between the needs of users who want historical access and the WS space constraints.

8 C存储查询执行

8 C-Store Query Execution

查询优化器将接受SQL查询并构造执行节点的查询计划。在本节中,我们描述可以出现在计划中的节点,然后描述优化器本身的架构。

The query optimizer will accept a SQL query and construct a query plan of execution nodes. In this section, we describe the nodes that can appear in a plan and then the architecture of the optimizer itself.

8.1 查询运算符和计划格式

8.1 Query Operators and Plan Format

有 10 种节点类型,每种类型都接受操作数或生成投影 ( Proj)、列 ( Col) 或位串 ( Bits) 类型的结果。投影只是一组具有相同基数和顺序的列。位串是零和一的列表,指示相关值是否存在于所描述的记录子集中。此外,C-Store 查询运算符接受谓词 ( Pred)、连接索引 ( JI)、属性名称 ( Att) 和表达式 ( Exp) 作为参数。

There are 10 node types and each accepts operands or produces results of type projection (Proj), column (Col), or bitstring (Bits). A projection is simply a set of columns with the same cardinality and ordering. A bitstring is a list of zeros and ones indicating whether the associated values are present in the record subset being described. In addition, C-Store query operators accept predicates (Pred), join indexes (JI), attribute names (Att), and expressions (Exp) as arguments.

连接索引和位串只是特殊类型的列。因此,它们也可以包含在预测中,并在适当的情况下用作操作员的输入。

Join indexes and bitstrings are simply special types of columns. Thus, they also can be included in projections and used as inputs to operators where appropriate.

下面我们简要总结一下每个运算符。

We briefly summarize each operator below.

  1.  解压缩将压缩列转换为未压缩(类型 4)表示形式。

  1.  Decompress converts a compressed column to an uncompressed (Type 4) representation.

  2.   Select相当于关系代数 (σ) 的选择运算符,但不是对其输入产生限制,而是产生结果的位串表示。

  2.  Select is equivalent to the selection operator of the relational algebra (σ), but rather than producing a restriction of its input, instead produces a bitstring representation of the result.

  3.   Mask接受位串B和投影Cs,并Cs通过仅发出相应位为B1 的那些值来进行限制。

  3.  Mask accepts a bitstring B and projection Cs, and restricts Cs by emitting only those values whose corresponding bits in B are 1.

  4.  投影等价于关系代数的投影算子(π)。

  4.  Project equivalent to the projection operator of the relational algebra (π).

  5.  排序按照这些列的某个子集(排序列)对投影中的所有列进行排序。

  5.  Sort sorts all columns in a projection by some subset of those columns (the sort columns).

  6.  聚合运算符对命名列以及由投影中的值标识的每个组计算类似 SQL 的聚合。

  6.  Aggregation Operators compute SQL-like aggregates over a named column, and for each group identified by the values in a projection.

  7.   Concat将一个或多个按相同顺序排序的投影合并为一个投影

  7.  Concat combines one or more projections sorted in the same order into a single projection

  8.   Permute根据连接索引定义的顺序排列投影。

  8.  Permute permutes a projection according to the ordering defined by a join index.

  9.   Join根据关联两个投影的谓词连接两个投影。

  9.  Join joins two projections according to a predicate that correlates them.

10.  位串运算符 BAnd产生两个位串的按位与。BOr产生按位或。BNot产生位串的补码。

10.  Bitstring Operators BAnd produces the bitwise AND of two bitstrings. BOr produces a bitwise OR. BNot produces the complement of a bitstring.

C-Store 查询计划由上面列出的运算符树组成,叶子上有访问方法,迭代器充当连接节点之间的接口。每个非叶计划节点通过调用“get_next”,通过标准迭代器接口 [ GRAE93 ] 的修改版本来消耗其子节点生成的数据。为了减少计划节点之间的通信开销(即“get_next”的调用次数),C-Store 迭代器从单个列返回 64K 块。这种方法保留了使用迭代器(将数据流与控制流耦合)的优点,同时更改数据流的粒度以更好地匹配基于列的模型。

A C-Store query plan consists of a tree of the operators listed above, with access methods at the leaves and iterators serving as the interface between connected nodes. Each non-leaf plan node consumes the data produced by its children via a modified version of the standard iterator interface [GRAE93] via calls of “get_next.” To reduce communication overhead (i.e., number of calls of “get_next”) between plan nodes, C-Store iterators return 64K blocks from a single column. This approach preserves the benefit of using iterators (coupling data flow with control flow), while changing the granularity of data flow to better match the column-based model.

8.2  查询优化

8.2  Query Optimization

我们计划使用 Selinger 风格的 [ SELI79 ] 优化器,该优化器使用基于成本的估计来构建计划。我们预计使用两阶段优化器 [ HONG92 ] 来限制计划搜索空间的复杂性。请注意,此设置中的查询优化至少在两个方面不同于传统查询优化:需要考虑数据的压缩表示以及有关何时使用位串屏蔽投影的决策。

We plan to use a Selinger-style [SELI79] optimizer that uses cost-based estimation for plan construction. We anticipate using a two-phase optimizer [HONG92] to limit the complexity of the plan search space. Note that query optimization in this setting differs from traditional query optimization in at least two respects: the need to consider compressed representations of data and the decisions about when to mask a projection using a bitstring.

便利商店运算符能够对压缩和未压缩的输入进行操作。正如第 9 节所示,处理压缩数据的能力是 C-Store 性能优势的关键。运算符的执行成本(在 I/O 和内存缓冲区要求方面)取决于输入的压缩类型。例如,一个Select可以通过仅从磁盘读取那些值与谓词匹配的位图(尽管列本身未排序)来执行对类型 2 数据(外来顺序/少量值,存储为增量编码位图,每个值一个位图)的操作。但是,采用类型 2 数据作为输入的运算符需要比其他三种压缩类型中的任何一种都要大得多的内存缓冲区空间(列中每个可能的值需要一页内存)。因此,成本模型必须对输入和输出列的表示敏感。

C-Store operators have the capability to operate on both compressed and uncompressed input. As will be shown in Section 9, the ability to process compressed data is the key to the performance benefits of C-Store. An operator’s execution cost (both in terms of I/O and memory buffer requirements) is dependent on the compression type of the input. For example, a Select over Type 2 data (foreign order/few values, stored as a delta-encoded bitmaps, with one bitmap per value) can be performed by reading only those bitmaps from disk whose values match the predicate (despite the column itself not being sorted). However, operators that take Type 2 data as input require much larger memory buffer space (one page of memory for each possible value in the column) than any of the other three types of compression. Thus, the cost model must be sensitive to the representations of input and output columns.

主要的优化器决策是针对给定查询使用哪一组投影。显然,为每种可能性制定一个计划,然后选择最好的一个是非常耗时的。我们的重点将是修剪这个搜索空间。此外,优化器必须根据位串决定计划中的何处屏蔽投影。例如,在某些情况下,需要推动Mask计划的早期(例如,避免在对类型 2 压缩数据执行选择时产生位串),而在其他情况下,最好延迟屏蔽,直到可能的点将位串提供给计划中的下一个运算符(例如COUNT),该运算符仅通过处理位串即可产生结果。

The major optimizer decision is which set of projections to use for a given query. Obviously, it will be time consuming to construct a plan for each possibility, and then select the best one. Our focus will be on pruning this search space. In addition, the optimizer must decide where in the plan to mask a projection according to a bitstring. For example, in some cases it is desirable to push the Mask early in the plan (e.g, to avoid producing a bitstring while performing selection over Type 2 compressed data) while in other cases it is best to delay masking until a point where it is possible to feed a bitstring to the next operator in the plan (e.g., COUNT) that can produce results solely by processing the bitstring.

9 性能比较

9 Performance Comparison

目前,我们已经有了一个存储引擎和运行RS的执行器。我们早期实现了 WS 和元组移动器;然而,它们还没有达到我们可以对其进行实验的程度。因此,我们的性能分析仅限于只读查询,并且我们还无法报告更新。此外,RS还不支持分段或多个网格节点。因此,我们报告单个站点的数量。一旦系统的其他部分建成,将进行更全面的性能研究。

At the present time, we have a storage engine and the executor for RS running. We have an early implementation of the WS and tuple mover; however they are not at the point where we can run experiments on them. Hence, our performance analysis is limited to read-only queries, and we are not yet in a position to report on updates. Moreover, RS does not yet support segments or multiple grid nodes. As such, we report single-site numbers. A more comprehensive performance study will be done once the other pieces of the system have been built.

我们的基准测试系统是 3.0 Ghz Pentium,运行 RedHat Linux,具有 2 GB 内存和 750 GB 磁盘。

Our benchmarking system is a 3.0 Ghz Pentium, running RedHat Linux, with 2 Gbytes of memory and 750 Gbytes of disk.

在决策支持(仓库)市场中,TPC-H 是黄金标准,我们使用该基准的简化版本,我们当前的引擎能够运行该版本。具体来说,我们实现lineitem、ordercustomer表如下:

In the decision support (warehouse) market TPC-H is the gold standard, and we use a simplified version of this benchmark, which our current engine is capable of running. Specifically, we implement the lineitem, order, and customer tables as follows:

图像

我们选择 INTEGER 和 CHAR(1) 类型的列来简化实现。上述 TPC-H scale_10 表格架构的标准数据总计 60,000,000 个行项目 (1.8GB),由 TPC 网站上提供的数据生成器生成。

We chose columns of type INTEGER and CHAR(1) to simplify the implementation. The standard data for the above table schema for TPC-H scale_10 totals 60,000,000 line items (1.8GB), and was generated by the data generator available from the TPC website.

我们测试了三个系统,并为每个系统的所有数据和索引提供了 2.7 GB 的存储预算(大约是原始数据大小的 1.5 倍)。这三个系统是如上所述的 C-Store 和两个流行的商业关系 DBMS 系统,一个实现行存储,另一个实现列存储。在这两个系统中,我们关闭了锁定和日志记录。我们为这三个系统设计了架构,以便在给定上述存储预算的情况下实现最佳性能。行存储无法在空间限制内运行,因此我们给了它 4.5 GB,这是它存储表和索引所需的空间。实际磁盘使用量如下所示。显然,C-Store 使用了行存储 40% 的空间,尽管它使用了冗余而行存储没有。主要原因是 C 存储压缩以及字或块边界缺乏填充。列存储比 C-Store 需要多 30% 的空间。同样,由于出色的压缩和无填充,C-Store 可以在更少的空间中存储冗余模式。

We tested three systems and gave each of them a storage budget of 2.7 GB (roughly 1.5 times the raw data size) for all data plus indices. The three systems were C-Store as described above and two popular commercial relational DBMS systems, one that implements a row store and another that implements a column store. In both of these systems, we turned off locking and logging. We designed the schemas for the three systems in a way to achieve the best possible performance given the above storage budget. The row-store was unable to operate within the space constraint so we gave it 4.5 GB which is what it needed to store its tables plus indices. The actual disk usage numbers are shown below. Obviously, C-Store uses 40% of the space of the row store, even though it uses redundancy and the row store does not. The main reasons are C-Store compression and absence of padding to word or block boundaries. The column store requires 30% more space than C-Store. Again, C-Store can store a redundant schema in less space because of superior compression and absence of padding.

便利店

C-Store

排店

Row Store

列存储

Column Store

1.987GB

1.987 GB

4.480 GB

4.480 GB

2.650 GB

2.650 GB

我们在每个系统上运行了以下七个查询:

We ran the following seven queries on each system:

Q1. 确定 D 天之后每天发货的订单项总数

Q1. Determine the total number of lineitems shipped for each day after day D.

图像

Q2。 确定 D 天每个供应商发货的订单项总数

Q2. Determine the total number of lineitems shipped for each supplier on day D.

图像

Q3。 确定 D 天后每个供应商发货的订单项总数

Q3. Determine the total number of lineitems shipped for each supplier after day D.

图像

Q4。 对于 D 之后的每一天,确定当天订购的所有商品的最晚发货日期

Q4. For every day after D, determine the latest shipdate of all items ordered on that day.

图像

Q5. 对于每个供应商,根据在某个日期 D 下的订单确定项目的最新发货日期

Q5. For each supplier, determine the latest shipdate of an item from an order that was made on some date, D.

图像

Q6. 对于每个供应商,根据某个日期 D 后发出的订单确定项目的最新发货日期

Q6. For each supplier, determine the latest shipdate of an item from an order made after some date, D.

图像

Q7. 返回客户所代表的所有国家/地区的标识符列表以及他们退回的零件的总收入损失。这是 TPC-H 查询 10 (Q10) 的简化版本。

Q7. Return a list of identifiers for all nations represented by customers along with their total lost revenue for the parts they have returned. This is a simplified version of query 10 (Q10) of TPC-H.

图像

我们为三个系统构建了最适合我们的七个查询工作负载的模式。这些模式针对每个系统的功能进行了单独调整。对于便利店,我们使用了以下模式:

We constructed schemas for each of the three systems that best matched our seven-query workload. These schema were tuned individually for the capabilities of each system. For C-Store, we used the following schema:

图像

D2 和 D4 是物化(连接)视图。添加 D3 和 D5 是为了完整性,因为我们不在七个查询中的任何一个中使用它们。包含它们是为了我们可以回答对此模式的任意查询,就像产品模式一样。

D2 and D4 are materialized (join) views. D3 and D5 are added for completeness since we don’t use them in any of the seven queries. They are included so that we can answer arbitrary queries on this schema as is true for the product schemas.

在商业行存储 DBMS 上,我们使用了上面给出的通用关系模式以及一系列特定于系统的调整参数。我们还为商业列存储 DBMS 使用了系统特定的调整参数。尽管我们相信我们为商业系统选择了良好的值,但显然我们不能保证它们是最优的。

On the commercial row-store DBMS, we used the common relational schema given above with a collection of system-specific tuning parameters. We also used system-specific tuning parameters for the commercial column-store DBMS. Although we believe we chose good values for the commercial systems, obviously, we cannot guarantee they are optimal.

下表显示了我们观察到的性能。所有测量均以秒为单位,并在专用机器上进行。

The following table indicates the performance that we observed. All measurements are in seconds and are taken on a dedicated machine.

图像

可以看出,C-Store 比任何一种商业产品都要快得多。主要原因有:

As can be seen, C-Store is much faster than either commercial product. The main reasons are:

•  列表示——避免读取未使用的属性(与竞争的列存储相同)。

•  Column representation – avoids reads of unused attributes (same as competing column store).

•  存储重叠投影,而不是整个表——允许根据需要存储列的多个排序。

•  Storing overlapping projections, rather than the whole table – allows storage of multiple orderings of a column as appropriate.

•  更好的数据压缩——允许在同一空间内进行更多排序。

•  Better compression of data – allows more orderings in the same space.

•  查询运算符对压缩表示进行操作——缓解当前处理器的存储障碍问题。

•  Query operators operate on compressed representation – mitigates the storage barrier problem of current processors.

为了给其他系统提供一切可能的优势,我们尝试使用与我们在 C-Store 中使用的投影相对应的物化视图来运行它们。这次,系统使用的空间如下(C-Store编号没有变化,仅供参考):

In order to give the other systems every possible advantage, we tried running them with the materialized views that correspond to the projections we used with C-Store. This time, the systems used space as follows (C-Store numbers, which did not change, are included as a reference):

便利店

C-Store

排店

Row Store

列存储

Column Store

1.987GB

1.987 GB

11.900GB

11.900 GB

4.090GB

4.090 GB

相对性能数字(以秒为单位)如下:

The relative performance numbers in seconds are as follows:

图像

可以看出,性能差距缩小了,但与此同时,两个商业系统所需的存储量增长得相当大。

As can be seen, the performance gap closes, but at the same time, the amount of storage needed by the two commercial systems grows quite large.

总之,对于这七个查询基准,在空间受限的情况下,C-Store 平均比商业行存储快 164 倍,比商业列存储快 21 倍。对于空间不受限制的情况,C-Store比商业行存储快6.4倍,但行存储占用6倍的空间。C-Store平均比商业列式存储快16.5倍,但列式存储需要1.83倍的空间。

In summary, for this seven query benchmark, C-Store is on average 164 times faster than the commercial row-store and 21 times faster than the commercial column-store in the space-constrained case. For the case of unconstrained space, C-Store is 6.4 times faster than the commercial row-store, but the row-store takes 6 times the space. C-Store is on average 16.5 times faster than the commercial column-store, but the column-store requires 1.83 times the space.

当然,这个性能数据是非常初步的。一旦我们运行 WS 并编写元组移动器,我们将能够更好地进行详尽的研究。

Of course, this performance data is very preliminary. Once we get WS running and write a tuple mover, we will be in a better position to do an exhaustive study.

10  相关工作

10  Related Work

仓库市场的推动力之一是维护所谓的“数据立方体”。这项工作可以追溯到 1990 年代初期 Arbor 软件的 Essbase,该软件在“切片和切块”大型数据集方面非常有效 [ GRAY97 ]。在存储的数据集上有效构建和维护特定聚合已被广泛研究[ KOTI99ZHAO97 ]。当定期运行一组预先指定的查询时,此类聚合以及更通用的物化视图 [ STAU96 ] 的预计算特别有效。另一方面,当无法提前预测工作负载时,很难决定要预先计算什么。便利店正是针对后一个问题。

One of the thrusts in the warehouse market is in maintaining so-called “data cubes.” This work dates from Essbase by Arbor software in the early 1990’s, which was effective at “slicing and dicing” large data sets [GRAY97]. Efficiently building and maintaining specific aggregates on stored data sets has been widely studied [KOTI99, ZHAO97]. Precomputation of such aggregates as well as more general materialized views [STAU96] is especially effective when a prespecified set of queries is run at regular intervals. On the other hand, when the workload cannot be anticipated in advance, it is difficult to decide what to precompute. C-Store is aimed entirely at this latter problem.

之前在数据镜像中研究过在单个系统中包含两个不同架构的 DBMS [ RAMA02 ]。然而,数据镜像的目标是实现比仓库环境中单独使用两个底层系统中的任何一个都能实现的查询性能更好的查询性能。相比之下,我们的目标是同时在更新工作负载和即席查询上实现良好的性能。因此,C-Store 在设计上与数据镜像有很大不同。

Including two differently architected DBMSs in a single system has been studied before in data mirrors [RAMA02]. However, the goal of data mirrors was to achieve better query performance than could be achieved by either of the two underlying systems alone in a warehouse environment. In contrast, our goal is to simultaneously achieve good performance on update workloads and ad-hoc queries. Consequently, C-Store differs dramatically from a data mirror in its design.

通过列存储数据已在多个系统中实现,包括 Sybase IQ、Addamark、Bubba [ COPE88 ]、Monet [ BONC04 ] 和 KDB。其中莫奈的设计理念可能最接近C-Store。然而,这些系统通常按条目顺序存储数据,并且没有我们的混合架构,也没有我们的重叠物化投影模型。

Storing data via columns has been implemented in several systems, including Sybase IQ, Addamark, Bubba [COPE88], Monet [BONC04], and KDB. Of these, Monet is probably closest to C-Store in design philosophy. However, these systems typically store data in entry sequence and do not have our hybrid architecture nor do they have our model of overlapping materialized projections.

类似地,使用倒置组织存储表也是众所周知的。在这里,每个属性都使用某种索引来存储,并且记录标识符用于查找其他列中的相应属性。C-Store 在 WS 中使用这种组织方式,但使用 RS 和元组移动器扩展了架构。

Similarly, storing tables using an inverted organization is well known. Here, every attribute is stored using some sort of indexing, and record identifiers are used to find corresponding attributes in other columns. C-Store uses this sort of organization in WS but extends the architecture with RS and a tuple mover.

在数据库中使用压缩数据方面已经进行了大量工作;Roth 和 Van Horn [ ROTH93 ] 对许多已开发的技术进行了精彩的总结。我们的编码方案与其中一些技术类似,所有这些技术都源自更广泛的计算机科学领域中该主题的长期工作历史[ WITT87 ]。我们之前就观察到可以直接对压缩数据进行操作[ GRAE91WESM00 ]。

There has been substantial work on using compressed data in databases; Roth and Van Horn [ROTH93] provide an excellent summary of many of the techniques that have been developed. Our coding schemes are similar to some of these techniques, all of which are derived from a long history of work on the topic in the broader field of computer science [WITT87]. Our observation that it is possible to operate directly on compressed data has been made before [GRAE91, WESM00].

最后,物化视图、快照隔离、事务管理和高可用性也得到了广泛的研究。C-Store 的贡献是这些技术的创新组合,同时提供改进的性能、K-安全性、高效检索和高性能交易。

Lastly, materialized views, snapshot isolation, transaction management, and high availability have also been extensively studied. The contribution of C-Store is an innovative combination of these techniques that simultaneously provides improved performance, K-safety, efficient retrieval, and high performance transactions.

11  结论

11  Conclusions

本文介绍了 C-Store 的设计,它与当前 DBMS 的体系结构截然不同。与当前的商业系统不同,它针对的是“以读为主”的 DBMS 市场。便利店体现的创新贡献包括:

This paper has presented the design of C-Store, a radical departure from the architecture of current DBMSs. Unlike current commercial systems, it is aimed at the “read-mostly” DBMS market. The innovative contributions embodied in C-Store include:

• 列存储表示形式,具有关联的查询执行引擎。

•  A column store representation, with an associated query execution engine.

• 允许在列存储上进行事务的混合架构。

•  A hybrid architecture that allows transactions on a column store.

• 通过对数据值进行编码和密集封装数据,重点关注节省磁盘上的存储表示。

•  A focus on economizing the storage representation on disk, by coding data values and dense-packing the data.

• 由表的重叠投影组成的数据模型,与表、二级索引和投影的标准票价不同。

•  A data model consisting of overlapping projections of tables, unlike the standard fare of tables, secondary indexes, and projections.

• 针对无共享机器环境优化的设计。

•  A design optimized for a shared nothing machine environment.

• 没有重做日志或两阶段提交的分布式事务。

•  Distributed transactions without a redo log or two phase commit.

• 高效的快照隔离。

•  Efficient snapshot isolation.

致谢和参考文献

Acknowledgements and References

我们要感谢 David DeWitt 提供的有用反馈和想法。

We would like to thank David DeWitt for his helpful feedback and ideas.

这项工作得到了美国国家科学基金会的支持,编号为 IIS-0086057 和 IIS-0325525。

This work was supported by the National Science Foundation under NSF Grant numbers IIS-0086057 and IIS-0325525.

[ ADDA04 ] http://www.addamark.com/products/sls.htm

[ADDA04] http://www.addamark.com/products/sls.htm

[ BERE95 ] 哈尔·贝伦森 (Hal Berenson) 等人。对 ANSI SQL 隔离级别的批评。SIGMOD 会议记录,1995 年。

[BERE95] Hal Berenson et al. A Critique of ANSI SQL Isolation Levels. In Proceedings of SIGMOD, 1995.

[ BONC04 ] 彼得·邦茨等。等人。MonetDB/X100:超流水线查询执行。CIDR 2004 会议论文集。

[BONC04] Peter Boncz et. al. MonetDB/X100: Hyper-pipelining Query Execution. In Proceedings CIDR 2004.

[ CERI91 ] S. Ceri 和 J. Widom。导出增量视图维护的生产规则。在VLDB中,1991 年。

[CERI91] S. Ceri and J. Widom. Deriving Production Rules for Incremental View Maintenance. In VLDB, 1991.

[ COPE88 ] 乔治·科普兰等。等人。Bubba 中的数据放置。SIGMOD 会议记录 1988

[COPE88] George Copeland et. al. Data Placement in Bubba. In Proceedings SIGMOD 1988.

[ DEWI90 ] 大卫·德威特等。等人。GAMMA 数据库机项目。IEEE 知识与数据工程汇刊,2(1),1990 年 3 月。

[DEWI90] David Dewitt et. al. The GAMMA Database machine Project. IEEE Transactions on Knowledge and Data Engineering, 2(1), March, 1990.

[ DEWI92 ] 大卫·德威特和吉姆·格雷。并行数据库系统:高性能数据库处理的未来。ACM 通讯,1992 年。

[DEWI92] David Dewitt and Jim Gray. Parallel Database Systems: The Future of High Performance Database Processing. Communications of the ACM, 1992.

[ FREN95 ] 克拉克·D·弗伦奇。一种适用于所有数据库架构的方法不适用于 DSS。SIGMOD 会议记录,1995 年。

[FREN95] Clark D. French. One Size Fits All Database Architectures Do Not Work for DSS. In Proceedings of SIGMOD, 1995.

[ GRAE91 ] 戈茨·格雷夫,伦纳德·D·夏皮罗。数据压缩和数据库性能。应用计算研讨会论文集,1991 年。

[GRAE91] Goetz Graefe, Leonard D. Shapiro. Data Compression and Database Performance. In Proceedings of the Symposium on Applied Computing, 1991.

[ GRAE93 ] G.格雷夫。大型数据库的查询评估技术。计算调查,25(2),1993。

[GRAE93] G. Graefe. Query Evaluation Techniques for Large Databases. Computing Surveys, 25(2), 1993.

[ GRAY92 ] 吉姆·格雷和安德烈亚斯·路透。事务处理概念和技术,Morgan Kaufman,1992。

[GRAY92] Jim Gray and Andreas Reuter. Transaction Processing Concepts and Techniques, Morgan Kaufman, 1992.

[ GRAY97 ] 格雷等人。DataCube:通用分组、交叉表和小计的关系聚合运算符。数据挖掘和知识发现,1(1),1997。

[GRAY97] Gray et al. DataCube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. Data Mining and Knowledge Discovery, 1(1), 1997.

[ HONG92 ] 魏红和迈克尔·斯通布雷克。利用 XPRS 中的互操作器并行性。在SIGMOD中,1992 年。

[HONG92] Wei Hong and Michael Stonebraker. Exploiting Interoperator Parallelism in XPRS. In SIGMOD, 1992.

[ KDB04 ] http://www.kx.com/products/database.php

[KDB04] http://www.kx.com/products/database.php

[ KOTI99 ] 雅尼斯·科蒂迪斯,尼克·鲁索普洛斯。DynaMat:数据仓库的动态视图管理系统。SIGMOD 会议记录,1999 年。

[KOTI99] Yannis Kotidis, Nick Roussopoulos. DynaMat: A Dynamic View Management System for Data Warehouses. In Proceedings of SIGMOD, 1999.

[莫哈92 ]C.莫汉等人。al:ARIES:一种使用预写日志记录支持细粒度锁定和部分回滚的事务恢复方法。TODS,1992 年 3 月。

[MOHA92] C. Mohan et. al: ARIES: A Transaction Recovery Method Supporting Fine-granularity Locking and Partial Rollbacks Using Write-ahead Logging. TODS, March 1992.

[ ONEI96 ] Patrick O'Neil、Edward Cheng、Dieter Gawlick 和 Elizabeth O'Neil,日志结构合并树。信息学报33,1996 年 6 月。

[ONEI96] Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil, The Log-Structured Merge-Tree. Acta Informatica 33, June 1996.

[ ONEI97 ]P.奥尼尔和D.夸斯。使用变体索引提高查询性能,SIGMOD 会议记录,1997 年。

[ONEI97] P. O’Neil and D. Quass. Improved Query Performance with Variant Indexes, In Proceedings of SIGMOD, 1997.

[ ORAC04 ] 甲骨文公司。用于数据仓库和商业智能的 Oracle 9i 数据库。白皮书。http://www.oracle.com/solutions/business_intelligence/Oracle9idw_bwp

[ORAC04] Oracle Corporation. Oracle 9i Database for Data Warehousing and Business Intelligence. White Paper. http://www.oracle.com/solutions/business_intelligence/Oracle9idw_bwp.

[ PAPA04 ] 斯特拉托斯·帕帕多曼拉基斯和阿纳斯塔西娅·艾拉马基。AutoPart:使用数据分区自动化大型科学数据库的架构设计。在SSDBM 2004中。

[PAPA04] Stratos Papadomanolakis and Anastassia Ailamaki. AutoPart: Automating Schema Design for Large Scientific Databases Using Data Partitioning. In SSDBM 2004.

[ RAMA02 ] Ravishankar Ramamurthy,大卫·德威特。齐苏:镜子破裂的案例。VLDB 会议记录,2002 年。

[RAMA02] Ravishankar Ramamurthy, David Dewitt. Qi Su: A Case for Fractured Mirrors. In Proceedings of VLDB, 2002.

[ ROTH93 ] Mark A. Roth、Scott J. Van Horn:数据库压缩。SIGMOD 记录22(3)。1993年。

[ROTH93] Mark A. Roth, Scott J. Van Horn: Database Compression. SIGMOD Record 22(3). 1993.

[ SELI79 ] 帕特里夏·塞林格、莫顿·阿斯特拉汉、唐纳德·张伯伦、雷蒙德·洛里、托马斯·普莱斯。关系数据库中的访问路径选择。SIGMOD 会议记录,1979 年。

[SELI79] Patricia Selinger, Morton Astrahan, Donald Chamberlain, Raymond Lorie, Thomas Price. Access Path Selection in a Relational Database. In Proceedings of SIGMOD, 1979.

[ SLEE04 ] http://www.sleepycat.com/docs/

[SLEE04] http://www.sleepycat.com/docs/

[ STAU96 ] 马丁·施陶特,马蒂亚斯·贾克。外部物化视图的增量维护。在VLDB中,1996 年。

[STAU96] Martin Staudt, Matthias Jarke. Incremental Maintenance of Externally Materialized Views. In VLDB, 1996.

[ STON86 ] 迈克尔·斯通布雷克。不共享任何内容的案例。数据库工程,9(1),1986。

[STON86] Michael Stonebraker. The Case for Shared Nothing. In Database Engineering, 9(1), 1986.

[ SYBA04 ] http://www.sybase.com/products/databaseservers/sybaseiq

[SYBA04] http://www.sybase.com/products/databaseservers/sybaseiq

[ TAND89 ] Tandem 数据库组:NonStop SQL,一种分布式高性能、高可用性的 SQL 实现。HPTPS 会议记录,1989 年。

[TAND89] Tandem Database Group: NonStop SQL, A Distributed High Performance, High Availability Implementation of SQL. In Proceedings of HPTPS, 1989.

[ VSAM04 ] http://www.redbooks.ibm.com/redbooks.nsf/0/8280b48d5e3997bf85256cbd007e4a96?OpenDocument

[VSAM04] http://www.redbooks.ibm.com/redbooks.nsf/0/8280b48d5e3997bf85256cbd007e4a96?OpenDocument

[ WESM00 ] 蒂尔·韦斯特曼、唐纳德·科斯曼、斯文·赫尔默、吉多·莫尔科特。压缩数据库的实现和性能。SIGMOD 记录29(3), 2000。

[WESM00] Till Westmann, Donald Kossmann, Sven Helmer, Guido Moerkotte. The Implementation and Performance of Compressed Databases. SIGMOD Record 29(3), 2000.

[ WEST00 ] 保罗·韦斯特曼。数据仓库:使用沃尔玛模型。摩根考夫曼出版社,2000 年。

[WEST00] Paul Westerman. Data Warehousing: Using the Wal-Mart Model. Morgan-Kaufmann Publishers, 2000.

[ WITT87 ] I. Witten、R. Neal 和 J. Cleary。用于数据压缩的算术编码。通讯。ACM,30(6),1987 年 6 月。

[WITT87] I. Witten, R. Neal, and J. Cleary. Arithmetic coding for data compression. Comm. of the ACM, 30(6), June 1987.

[ ZHAO97 ] Y. 赵、P. Deshpande 和 J. Naughton。用于同时多维聚合的基于数组的算法。SIGMOD 会议记录,1997 年。

[ZHAO97] Y. Zhao, P. Deshpande, and J. Naughton. An Array-Based Algorithm for Simultaneous Multidimensional Aggregates. In Proceedings of SIGMOD, 1997.

允许免费复制本材料的全部或部分内容,前提是这些副本不是为了直接商业利益而制作或分发的,VLDB 版权声明以及出版物的标题和日期出现,并且通知复制是由超大型数据库捐赠基金的许可。以其他方式复制或重新发布,需要付费和/或获得 2005 年挪威特隆赫姆第 31 届 VLDB 会议捐赠论文集的特别许可

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005

POSTGRES 的实施

The Implementation of POSTGRES

迈克尔·斯通布雷克、劳伦斯·A·罗、迈克尔·广滨

Michael Stonebraker, Lawrence A. Rowe, Michael Hirohama

摘要:目前,POSTGRES 大约有 90 000 行 C 代码,并被各种“大胆而勇敢”的早期用户所使用。该系统是由一名全职首席程序员领导的一个由五名兼职学生组成的团队在过去三年中构建的。在此期间,我们做出了大量的设计和实现选择。此外,在某些领域,如果我们重新从头开始,我们的做法会完全不同。本文的目的是反思我们所做的设计和实现决策,并向可能遵循我们的某些路径的实现者提供建议。在本文中,我们将注意力集中在 DBMS“后端”功能上。在另一篇论文中,我们中的一些人讨论了 Picasso,这是构建在 POSTGRES 之上的应用程序开发环境。

Abstract—Currently, POSTGRES is about 90 000 lines of code in C and is being used by assorted “bold and brave” early users. The system has been constructed by a team of five part-time students led by a full-time chief programmer over the last three years. During this period, we have made a large number of design and implementation choices. Moreover, in some areas we would do things quite differently if we were to start from scratch again. The purpose of this paper is to reflect on the design and implementation decisions we made and to offer advice to implementors who might follow some of our paths. In this paper, we restrict our attention to the DBMS “backend” functions. In another paper, some of us treat Picasso, the application development environment that is being built on top of POSTGRES.

索引术语——可扩展数据库、下一代 DBMS、无覆盖存储管理器、面向对象数据库、规则系统。

Index Terms—Extensible databases, next-generation DBMS’s, no-overwrite storage managers, object-oriented databases, rule systems.

一、  简介

I  Introduction

当前的关系 DBMS 面向的是对必须存储和访问大量固定格式记录实例的业务数据处理应用程序的有效支持。该应用领域的传统事务管理和查询设施将被称为数据管理

Current relational DBMS’s are oriented toward efficient support for business data processing applications where large numbers of instances of fixed format records must be stored and accessed. Traditional transaction management and query facilities for this application area will be termed data management.

为了满足业务应用程序之外更广泛的应用程序社区的需求,DBMS 必须扩展以提供另外两个维度的服务,即对象管理知识管理。对象管理需要有效地存储和操作非传统数据类型,例如位图、图标、文本和多边形。对象管理问题在 CAD 和许多其他工程应用程序中比比皆是。面向对象的编程语言和数据库提供了这方面的服务。

To satisfy the broader application community outside of business applications, DBMS’s will have to expand to offer services in two other dimensions, namely object management and knowledge management. Object management entails efficiently storing and manipulating nontraditional data types such as bitmaps, icons, text, and polygons. Object management problems abound in CAD and many other engineering applications. Object-oriented programming languages and databases provide services in this area.

知识管理需要存储和执行作为应用程序语义一部分的规则集合的能力。此类规则描述了有关应用程序的完整性约束,并允许导出未直接存储在数据库中的数据。

Knowledge management entails the ability to store and enforce a collection of rules that are part of the semantics of an application. Such rules describe integrity constraints about the application, as well as allowing the derivation of data that are not directly stored in the database.

现在我们举一个简单的例子,它需要所有三个维度的服务。考虑一个存储和操作文本和图形以方便报纸副本布局的应用程序。这样的系统将自然地与订阅和分类广告数据集成。为这些服务向客户计费将需要传统的数据管理服务。此外,该应用程序必须存储非传统对象,包括文本、位图(图片)和图标(纸张顶部的横幅)。因此,需要对象管理服务。最后,有许多规则控制报纸的布局。例如,两家主要百货公司的广告文案永远不能出现在对页上。在本申请中需要支持这样的规则。

We now indicate a simple example which requires services in all three dimensions. Consider an application that stores and manipulates text and graphics to facilitate the layout of newspaper copy. Such a system will be naturally integrated with subscription and classified advertisement data. Billing customers for these services will require traditional data management services. In addition, this application must store nontraditional objects including text, bitmaps (pictures), and icons (the banner across the top of the paper). Hence, object management services are required. Lastly, there are many rules that control newspaper layout. For example, the ad copy for two major department stores can never be on facing pages. Support for such rules is desirable in this application.

我们相信大多数现实世界的数据管理问题都是三维的。与报纸应用程序一样,它们需要三维解决方案。POSTGRES [ 26 ]、[ 35 ]的根本目标是为此类三维应用提供支持。据我们所知,它是第一个三维数据管理器。然而,我们预计大多数 DBMS 将跟随 POSTGRES 的领导进入这些新的维度。

We believe that most real world data management problems are three dimensional. Like the newspaper application, they will require a three-dimensional solution. The fundamental goal of POSTGRES [26], [35] is to provide support for such three-dimensional applications. To the best of our knowledge it is the first three-dimensional data manager. However, we expect that most DBMS’s will follow the lead of POSTGRES into these new dimensions.

为了实现这一目标,对象和规则管理功能被添加到传统数据管理器的服务中。在接下来的两节中,我们将描述所提供的功能并对我们的实施决策进行评论。然后,在第四节中,我们讨论在 POSTGRES 中实现的新型非覆盖存储管理器。其他论文已经解释了这些领域的主要 POSTGRES 设计决策,我们假设读者熟悉 [ 21 ]关于数据模型,[ 30 ]关于规则管理,[ 28 ]关于存储管理。因此,在这三个部分中,我们强调导致我们设计的考虑因素、我们喜欢设计的哪些方面以及我们认为犯的错误。在适当的情况下,我们会根据我们的经验为未来的实施者提出建议。

To accomplish this objective, object and rule management capabilities were added to the services found in a traditional data manager. In the next two sections, we describe the capabilities provided and comment on our implementation decisions. Then, in Section IV we discuss the novel no-overwrite storage manager that we implemented in POSTGRES. Other papers have explained the major POSTGRES design decisions in these areas, and we assume that the reader is familiar with [21] on the data model, [30] on rule management, and [28] on storage management. Hence, in these three sections we stress considerations that led to our design, what we liked about the design, and the mistakes that we felt we made. Where appropriate we make suggestions for future implementors based on our experience.

本文的第五部分评论了 POSTGRES 实施中的具体问题,并批评了我们所做的选择。在本节中,我们将讨论如何与操作系统交互、我们对编程语言的选择以及我们的一些实现理念。

Section V of the paper comments on specific issues in the implementation of POSTGRES and critiques the choices that we made. In this section, we discuss how we interfaced to the operating system, our choice of programming languages, and some of our implementation philosophy.

最后一部分以 POSTGRES 的一些性能测量作为结束。具体来说,我们报告了威斯康星州基准 [ 7 ]中的一些查询的结果。

The final section concludes with some performance measurements of POSTGRES. Specifically, we report the results of some of the queries in the Wisconsin benchmark [7].

II   POSTGRES 数据模型和查询语言

II  The POSTGRES Data Model and Query Language

二、A  简介

II.A  Introduction

传统的关系 DBMS 支持由命名关系集合组成的数据模型,其中的每个属性都有特定的类型。在当前的商业系统中,可能的类型是浮点数、整数、字符串和日期。人们普遍认为该数据模型不足以满足非业务数据处理应用程序的需要。在设计新的数据模型和查询语言时,我们遵循以下三个设计标准。

Traditional relational DBMS’s support a data model consisting of a collection of named relations, each attribute of which has a specific type. In current commercial systems, possible types are floating point numbers, integers, character strings, and dates. It is commonly recognized that this data model is insufficient for nonbusiness data processing applications. In designing a new data model and query language, we were guided by the following three design criteria.

1)面向查询语言的数据库访问:我们期望 POSTGRES 用户主要通过使用面向集合的查询语言 POSTQUEL 与其数据库进行交互。因此,包含查询语言、优化器和相应的运行时系统是主要设计目标。

1) Orientation toward database access from a query language: We expect POSTGRES users to interact with their databases primarily by using the set-oriented query language, POSTQUEL. Hence, inclusion of a query language, an optimizer, and the corresponding run-time system was a primary design goal.

还可以利用导航界面与 POSTGRES 数据库进行交互。此类接口因 1970 年代的 CODASYL 提案而流行,并在最近的面向对象提案(例如 ORION [ 6 ] 或 O2 [ 34])中得到复兴。)。由于 POSTGRES 为每条记录提供唯一标识符 (OID),因此可以将一条记录的标识符用作第二条记录中的数据项。使用 OID 上可选的可定义索引,然后可以通过每个导航步骤运行一个查询来从一个记录导航到下一个记录。此外,POSTGRES 允许用户定义 DBMS 的函数(方法)。此类函数可以散布在编程语言、查询语言中的语句以及对内部 POSTGRES 接口的直接调用。POSTGRES 提供了直接执行我们称为快速路径的函数的能力,并允许用户通过执行一系列函数来导航数据库。

It is also possible to interact with a POSTGRES database by utilizing a navigational interface. Such interfaces were popularized by the CODASYL proposals of the 1970’s and are enjoying a renaissance in recent object-oriented proposals such as ORION [6] or O2 [34). Because POSTGRES gives each record a unique identifier (OID), it is possible to use the identifier for one record as a data item in a second record. Using optionally definable indexes on OID’s, it is then possible to navigate from one record to the next by running one query per navigation step. In addition, POSTGRES allows a user to define functions (methods) to the DBMS. Such functions can intersperse statements in a programming language, a query language, and direct calls to internal POSTGRES interfaces. The ability to directly execute functions which we call fast path is provided in POSTGRES and allows a user to navigate the database by executing a sequence of functions.

然而,我们并不期望这种机制会流行起来。所有导航界面都具有与 CODASYL 系统相同的缺点,即应用程序程序员必须为他想要完成的每个任务构建查询计划,并且每当模式更改时都需要大量的应用程序维护。

However, we do not expect this sort of mechanism to become popular. All navigational interfaces have the same disadvantages of CODASYL systems, namely the application programmer must construct a query plan for each task he wants to accomplish and substantial application maintenance is required whenever the schema changes.

2)面向多语言访问:我们可以选择我们最喜欢的编程语言,然后将 POSTGRES 与该语言的编译器和运行时环境紧密耦合。这种方法将为该编程语言中的变量以及与该语言的控制语句集成的查询语言提供持久性。ODE [l] 和许多最近从事面向对象数据库的商业初创公司都遵循这种方法。

2) Orientation toward multilingual access: We could have picked our favorite programming language and then tightly coupled POSTGRES to the compiler and run-time environment of that language. Such an approach would offer persistence for variables in this programming language, as well as a query language integrated with the control statements of the language. This approach has been followed in ODE [l] and many of the recent commercial startups doing object-oriented databases.

我们的观点是,大多数数据库都是通过用几种不同语言编写的程序访问的,并且我们没有看到任何世界语编程语言即将出现。因此,大多数应用程序开发组织都是多语言的,并且需要使用不同语言访问数据库。此外,用户可能获取的数据库应用程序包(例如用于执行统计或电子表格服务)通常不是用用于开发应用程序的语言进行编码的。同样,这会导致多语言环境。

Our point of view is that most databases are accessed by programs written in several different languages, and we do not see any programming language Esperanto on the horizon. Therefore, most application development organizations are multilingual and require access to a database from different languages. In addition, database application packages that a user might acquire, for example to perform statistical or spreadsheet services, are often not coded in the language being used for developing applications. Again, this results in a multilingual environment.

因此,POSTGRES 是编程语言中立的,也就是说,它可以从许多不同的语言中调用。POSTGRES 与特定语言的紧密集成需要编译器扩展和特定于该编程语言的运行时系统。我们中的一个人在 POSTGRES 之上构建了持久 CLOS(通用 LISP 对象系统)的实现。持久 CLOS(或任何编程语言X的持久X)不可避免地是特定于语言的。运行时系统必须将语言对象的磁盘表示(包括指针)映射到语言期望的主内存表示。此外,必须在程序地址空间中维护对象缓存,否则性能将受到严重影响。这两项任务本质上都是特定于语言的。

Hence, POSTGRES is programming language neutral, that is, it can be called from many different languages. Tight integration of POSTGRES to a particular language requires compiler extensions and a run-time system specific to that programming language. One of us has built an implementation of persistent CLOS (Common LISP Object System) on top of POSTGRES. Persistent CLOS (or persistent X for any programming language X) is inevitably language specific. The run-time system must map the disk representation for language objects, including pointers, into the main memory representation expected by the language. Moreover, an object cache must be maintained in the program address space, or performance will suffer badly. Both tasks are inherently language specific.

我们期望为 POSTGRES 构建许多特定于语言的接口,并相信查询语言加上POSTGRES 中可用的快速路径接口提供了一个强大、方便的抽象来构建这些编程语言接口。

We expect many language specific interfaces to be built for POSTGRES and believe that the query language plus the fast path interface available in POSTGRES offer a powerful, convenient abstraction against which to build these programming language interfaces.

3)概念数量少:我们试图用尽可能少的概念来构建数据模型。关系模型成功地取代了以前的数据模型,部分原因是它的简单性。我们希望概念尽可能少,以便用户应对的复杂性降至最低。因此,POSTGRES 利用以下三个构造:

3) Small number of concepts: We tried to build a data model with as few concepts as possible. The relational model succeeded in replacing previous data models in part because of its simplicity. We wanted to have as few concepts as possible so that users would have minimum complexity to contend with. Hence, POSTGRES leverages the following three constructs:

类型

types

功能

functions

遗产。

inheritance.

在下一小节中,我们将简要回顾 POSTGRES 数据模型。然后,我们简要介绍一下 POSTQUEL 和快速路径。我们通过讨论 POSTGRES 是否是面向对象的来结束本节,然后对我们的数据模型和查询语言进行批评。

In the next subsection, we briefly review the POSTGRES data model. Then, we turn to a short description of POSTQUEL and fast path. We conclude the section with a discussion of whether POSTGRES is object-oriented followed by a critique of our data model and query language.

II.B   POSTGRES 数据模型

II.B  The POSTGRES Data Model

正如上一节中提到的,POSTGRES 利用类型函数作为基本构造。POSTGRES 中有三种类型和三种函数,我们在本节中讨论六种可能性。

As mentioned in the previous section, POSTGRES leverages types and functions as fundamental constructs. There are three kinds of types in POSTGRES and three kinds of functions and we discuss the six possibilities in this section.

一些研究人员,例如,[ 27 ]、[ 19 ],认为人们应该能够构造新的基本类型,例如位、位串、编码字符串、位图、压缩整数、压缩十进制数、基数 50 的十进制数与大多数具有硬连线的基本类型(通常是整数、浮点数和字符串)集合的下一代 DBMS 不同,POSTGRES 包含一个抽象数据类型工具,任何用户都可以通过该工具构造任意数量的新基本类型。此类类型可以在系统执行时添加到系统中,并要求定义用户指定将该类型的实例与字符串数据类型相互转换的函数。语法的详细信息见 [ 35]。

Some researchers, e.g., [27], [19], have argued that one should be able to construct new base types such as bits, bit-strings, encoded character strings, bitmaps, compressed integers, packed decimal numbers, radix 50 decimal numbers, money, etc. Unlike most next-generation DBMS’s which have a hardwired collection of base types (typically integers, floats, and character strings), POSTGRES contains an abstract data type facility whereby any user can construct an arbitrary number of new base types. Such types can be added to the system while it is executing and require the defining user to specify functions to convert instances of the type to and from the character string data type. Details of the syntax appear in [35].

POSTGRES 中可用的第二种类型是构造类型1用户可以通过构造基本类型和其他构造类型的实例的记录来创建新类型。例如,

The second kind of type available in POSTGRES is a constructed type.1 A user can create a new type by constructing a record of base types and instances of other constructed types. For example,

创建 DEPT(dname = c10,楼层 = 整数,楼层空间 = 多边形)

create DEPT (dname = c10, floor = integer, floor-space = polygon)

创建 EMP(名称 = c12,部门 = DEPT,工资 = 浮动)

create EMP (name = c12, dept = DEPT, salary = float)

这里,DEPT 是由以下三种基本类型的实例构造的类型:字符串、整数和多边形。另一方面,EMP 是由基本类型和其他构造类型构成的。

Here, DEPT is a type constructed from an instance of each of three base types: a character string, an integer, and a polygon. EMP, on the other hand, is fabricated from base types and other constructed types.

构造类型可以选择从其他构造类型继承数据元素。例如,可以按如下方式创建 SALESMAN 类型:

A constructed type can optionally inherit data elements from other constructed types. For example, a SALESMAN type can be created as follows:

创建SALESMAN(配额=浮动)继承(EMP)

create SALESMAN (quota = float) inherits (EMP)

在本例中,SALESMAN 的实例具有配额并继承了 EMP 的所有数据元素,即姓名、部门和薪水。我们对是否包含单继承还是多重继承进行了标准讨论,得出的结论是单继承方案的限制性太大。因此,POSTGRES 允许构造类型从其他构造类型的任意集合继承。

In this case, an instance of SALESMAN has a quota and inherits all data elements from EMP, namely name, dept, and salary. We had the standard discussion about whether to include single or multiple inheritance and concluded that a single inheritance scheme would simply be too restrictive. As a result, POSTGRES allows a constructed type to inherit from an arbitrary collection of other constructed types.

当由于一个对象有多个具有相同字段名称的父对象而出现歧义时,我们选择拒绝创建新类型。然而,我们将解析语义隔离在一个例程中,可以轻松更改该例程以跟踪多个继承语义,因为它们随着时间的推移在编程语言中展开。

When ambiguities arise because an object has multiple parents with the same field name, we elected to refuse to create the new type. However, we isolated the resolution semantics in a single routine, which can be easily changed to track multiple inheritance semantics as they unfold over time in programming languages.

我们现在转向 POSTGRES 函数概念。POSTGRES 函数分为三类:

We now turn to the POSTGRES notion of functions. There are three different classes of POSTGRES functions:

正常功能

normal functions

运营商

operators

POSTQUEL 函数

POSTQUEL functions

我们依次讨论每一个。

and we discuss each in turn.

用户可以定义操作数为基类型或构造类型的普通函数的任意集合。例如,他可以定义一个函数area,将多边形的实例映射为浮点数的实例。此类函数在查询语言中自动可用,如下例所示:

A user can define an arbitrary collection of normal functions whose operands are base types or constructed types. For example, he can define a function, area, which maps an instance of a polygon into an instance of a floating point number. Such functions are automatically available in the query language as illustrated in the following example:

检索(DEPT.dname)

retrieve (DEPT.dname)

 其中面积 (DEPT.floorspace) > 500

 where area (DEPT.floorspace) > 500

普通函数可以在系统运行时定义到 POSTGRES,并在查询执行期间需要时动态加载。

Normal functions can be defined to POSTGRES while the system is running and are dynamically loaded when required during query execution.

构造类型允许使用函数,例如,

Functions are allowed on constructed types, e.g.,

检索多付 (EMP) 的 (EMP.name)

retrieve (EMP.name) where overpaid (EMP)

在本例中,overpaid 有一个 EMP 类型的操作数并返回一个布尔值。操作数为构造类型的函数以标准方式沿类型层次结构继承。

In this case, overpaid has an operand of type EMP and returns a Boolean. Functions whose operands are constructed types are inherited down the type hierarchy in the standard way.

普通函数是用通用编程语言(在我们的例子中是 C 或 LISP)编写的任意过程。因此,它们具有任意语义并可以在执行期间运行其他 POSTQUEL 命令。因此,POSTGRES 查询优化器无法优化限定条件中具有正常功能的查询。例如,上面对薪酬过高的员工的查询将导致对所有员工进行顺序扫描。

Normal functions are arbitrary procedures written in a general purpose programming language (in our case C or LISP). Hence, they have arbitrary semantics and can run other POSTQUEL commands during execution. Therefore, queries with normal functions in the qualification cannot be optimized by the POSTGRES query optimizer. For example, the above query on overpaid employees will result in a sequential scan of all employees.

为了在处理查询时利用索引,POSTGRES 支持第二类函数,称为运算符。运算符是具有一个或两个操作数的函数,它们在查询语言中使用标准运算符表示法。例如,以下查询查找建筑面积大于特定多边形面积的部门:

To utilize indexes in processing queries, POSTGRES supports a second class of functions, called operators. Operators are functions with one or two operands which use the standard operator notation in the query language. For example, the following query looks for departments whose floor space has a greater area than that of a specific polygon:

检索 (DEPT.dname) 其中 DEPT。占地面积 AGT

retrieve (DEPT.dname) where DEPT. floorspace AGT

 多边形[“(0,0),(1,1),(0,2)”]。

 polygon[“(0,0), (1, 1), (0,2)”].

“面积大于”运算符 AGT 是通过指示查询语言中使用的标记以及调用以评估运算符的函数来定义的。此外,定义中还可以包含一些提示,以帮助查询优化器。这些提示之一是 ALE 是该运算符的否定符。因此,查询优化器可以对查询进行转换:

The “area greater than” operator AGT is defined by indicating the token to use in the query language as well as the function to call to evaluate the operator. Moreover, several hints can also be included in the definition which assist the query optimizer. One of these hints is that ALE is the negator of this operator. Therefore, the query optimizer can transform the query:

检索 (DEPT.dname),而不检索 (DEPT.floorspace)

retrieve (DEPT.dname) where not (DEPT.floorspace

 ALE 多边形[“(0,0),(1,1),(0,2)”])

 ALE polygon[ “(0,0), (1,1), (0,2)”])

不能优化的变成上面可以优化的。此外,POSTGRES 访问方法的设计允许为 DEPT 记录中出现的楼层空间实例构建B + 树索引。该索引可以支持对运算符类{ALT、ALE、AE、AGT、AGE}的高效访问。有关各个操作员可用的访问路径的信息记录在 POSTGRES 系统目录中。

which cannot be optimized into the one above which can be optimized. In addition, the design of the POSTGRES access methods allows a B+-tree index to be constructed for the instances of floorspace appearing in DEPT records. This index can support efficient access for the class of operators {ALT, ALE, AE, AGT, AGE}. Information on the access paths available to the various operators is recorded in the POSTGRES system catalogs.

如[ 29 ]中所指出的,用户必须能够构造新的访问方法s10以提供对非传统基类型的实例的有效访问。例如,假设用户引入了一个新的运算符“!!” 定义在多边形上,如果两个多边形重叠则返回 true。然后,他可能会问一个问题,例如

As pointed out in [29], it is imperative that a user be able to construct new access method s lO provide efficient access to instances of nontraditional base types. For example, suppose a user introduces a new operator “!!” defined on polygons that returns true if two polygons overlap. Then, he might ask a query such as

检索 (DEPT.dname) 其中 DEPT.floorspace !!

retrieve (DEPT.dname) where DEPT.floorspace !!

 多边形[“(0,0),(1,1),(0,2)”]

 polygon[“(0,0), (1, 1), (0,2)”]

没有B + 树或散列访问方法可以允许快速执行此查询。相反,查询必须由一些多维访问方法支持,例如R树、网格文件、KDB树等。因此,POSTGRES 被设计为允许 POSTGRES 用户编写新的访问方法,然后动态添加到系统中。基本上,POSTGRES 的访问方法是13 个普通函数的集合,这些函数执行记录级操作,例如在扫描中获取下一条记录、插入新记录、删除特定记录等。用户所需要做的就是为每个函数定义实现,并收集系统目录中的条目。

There is no B+-tree or hash access method that will allow this query to be rapidly executed. Rather, the query must be supported by some multidimensional access method such as R-trees, grid files, K-D-B trees, etc. Hence, POSTGRES was designed to allow new access methods to be written by POSTGRES users and then dynamically added to the system. Basically, an access method to POSTGRES is a collection of 13 normal functions which perform record level operations such as fetching the next record in a scan, inserting a new record, deleting a specific record, etc. All a user need do is define implementations for each of these functions and make a collection of entries in the system catalogs.

运算符仅适用于基本类型的操作数,因为访问方法传统上支持快速访问记录中的特定字段。目前还不清楚构造类型的访问方法应该做什么,因此 POSTGRES 不包含此功能。

Operators are only available for operands which are base types because access methods traditionally support fast access to specific fields in records. It is unclear what an access method for a constructed type should do, and therefore POSTGRES does not include this capability.

POSTGRES 中可用的第三种函数是POSTQUEL 函数。POSTQUEL 查询语言中的任何命令集合都可以打包在一起并定义为函数。例如,以下函数定义了超薪员工:

The third kind of function available in POSTGRES is POSTQUEL functions. Any collection of commands in the POSTQUEL query language can be packaged together and defined as a function. For example, the following function defines the overpaid employees:

将函数 high-pay 定义为检索 (EMP.all) 其中

define function high-pay as retrieve (EMP.all) where

 EMP薪资>50000

 EMP.salary > 50000

POSTQUEL 函数还可以有参数,例如,

POSTQUEL functions can also have parameters, for example,

将函数 ret-sal 定义为检索 (EMP.salary) 其中

define function ret-sal as retrieve (EMP.salary) where

 EMP.名称 = $1

 EMP.name = $1

请注意,ret-sal 在函数体中有一个参数,即所涉及人员的姓名。此类参数必须在调用函数时提供。第三个示例 POSTQUEL 函数是

Notice that ret-sal has one parameter in the body of the function, the name of the person involved. Such parameters must be provided at the time the function is called. A third example POSTQUEL function is

将函数 set-of-DEPT 定义为检索 (DEPT.all)

define function set-of-DEPT as retrieve (DEPT.all)

 其中 DEPT.floor = $.floor

 where DEPT.floor = $.floor

该函数有一个参数“$.floor”。它应该出现在记录中,并从同一记录中其他地方定义的楼层字段接收其参数值。

This function has a single parameter “$.floor.” It is expected to appear in a record and receives the value of its parameter from the floor field defined elsewhere in the same record.

每个 POSTQUEL 函数自动成为构造类型。例如,可以按如下方式定义 FLOORS 类型:

Each POSTQUEL function is automatically a constructed type. For example, one can define a FLOORS type as follows:

创建 FLOORS(楼层 = i 2,部门 = 部门集)

create FLOORS (floor = i2, depts = set-of-DEPT)

此构造类型使用 set-of-DEPT 函数作为构造类型。在这种情况下,FLOORS 的每个实例都有一个 depts 值,该值是该记录的函数 set-of-DEPT 的值。

This constructed type uses the set-of-DEPT function as a constructed type. In this case, each instance of FLOORS has a value for depts which is the value of the function set-of-DEPT for that record.

此外,POSTGRES 允许用户形成一种构造类型,其一个或多个字段具有特殊类型 POSTQUEL。例如,用户可以构造以下类型:

In addition, POSTGRES allows a user to form a constructed type, one or more of whose fields has the special type POSTQUEL. For example, a user can construct the following type:

创建人(姓名 = c12,爱好 = POSTQUEL)

create PERSON (name = c12, hobbies = POSTQUEL)

在这种情况下,爱好的每个实例都包含不同的 POSTQUEL 函数,因此每个人都有一个名字和一个定义其特定爱好的 POSTQUEL 函数。这种对 POSTQUEL 作为类型的支持允许系统模拟 NF**2 [ 11 ] 中的非规范化关系。

In this case, each instance of hobbies contains a different POSTQUEL function, and therefore each person has a name and a POSTQUEL function that defines his particular hobbies. This support for POSTQUEL as a type allows the system to simulate nonnormalized relations as found in NF**2 [11].

POSTQUEL 函数可以以与普通函数相同的方式出现在查询语言中。以下示例确保 Joe 与 Sam 具有相同的薪水:

POSTQUEL functions can appear in the query language in the same manner as normal functions. The following example ensures that Joe has the same salary as Sam:

替换 EMP (工资 = ret-sal(“Joe”)) 其中

replace EMP (salary = ret-sal(“Joe”)) where

EMP.name = “Sam”

EMP.name = “Sam”

此外,由于 POSTQUEL 函数是构造类型,因此可以像其他构造类型一样对 POSTQUEL 函数执行查询。例如,以下查询可以在构造类型 high-pay 上运行:

In addition, since POSTQUEL functions are a constructed type, queries can be executed against POSTQUEL functions just like other constructed types. For example, the following query can be run on the constructed type, high-pay:

检索 (high-pay.salary) 其中 high-pay.name =

retrieve (high-pay.salary) where high-pay.name =

 “乔治”

 “george”

如果 POSTQUEL 函数包含单个检索命令,则它与关系视图定义非常相似,并且此功能允许对本质上是关系视图的对象执行检索操作。

If a POSTQUEL function contains a single retrieve command, then it is very similar to a relational view definition, and this capability allows retrieval operations to be performed on objects which are essentially relational views.

最后,每次用户定义构造类型时,都会自动使用相同的名称定义 POSTQUEL 函数。例如,构建DEPT时,会自动定义以下函数:

Lastly, every time a user defines a constructed type, a POSTQUEL function is automatically defined with the same name. For example, when DEPT is constructed, the following function is automatically defined:

将函数 DEPT 定义为检索 (DEPT.all),其中 DEPT.OID = $1

define function DEPT as retrieve (DEPT.all) where DEPT.OID = $1

当 EMP 在本节前面定义时,它包含一个 DEPT 类型的字段 dept。事实上,DEPT就是上面自动定义的POSTQUEL函数。因此,构造类型的实例可用作类型,因为 POSTGRES 会自动为每个此类类型定义一个 POSTQUEL 函数。

When EMP was defined earlier in this section, it contained a field dept which was of type DEPT. In fact, DEPT was the above automatically defined POSTQUEL function. As a result, an instance of a constructed type is available as a type because POSTGRES automatically defines a POSTQUEL function for each such type.

POSTQUEL 函数是一个非常强大的概念,因为它们允许将类型实例的任意集合作为函数的值返回。由于POSTQUEL函数可以引用其他POSTQUEL函数,因此可以组装复杂对象的任意结构。最后,POSTQUEL 函数允许将命令集合(例如构成 TP1 [ 3 ] 的五个 SQL 命令)组装成单个函数并存储在 DBMS 内。然后,可以通过执行单个函数来执行TP1。这种方法优于从应用程序中一一提交TP1中的5条SQL命令。使用 POSTQUEL 函数,可以将应用程序和 DBMS 之间的 5 次往返替换为 1 次,从而使典型 OLTP 应用程序的性能提高 25%。

POSTQUEL functions are a very powerful notion because they allow arbitrary collections of instances of types to be returned as the value of the function. Since POSTQUEL functions can reference other POSTQUEL functions, arbitrary structures of complex objects can be assembled. Lastly, POSTQUEL functions allow collections of commands such as the five SQL commands that make up TP1 [3] to be assembled into a single function and stored inside the DBMS. Then, one can execute TP1 by executing the single function. This approach is preferred to having to submit the five SQL commands in TP1 one by one from an application program. Using a POSTQUEL function, one replaces five roundtrips between the application and the DBMS with 1, which results in a 25% performance improvement in a typical OLTP application.

II.C   POSTGRES 查询语言

II.C  The POSTGRES Query Language

上一节介绍了 POSTQUEL 语言的几个示例。它是一种面向集合的查询语言,类似于关系查询语言的超集。除了前面介绍的用户定义函数和运算符之外,传统关系语言中添加的功能还包括

The previous section presented several examples of the POSTQUEL language. It is a set-oriented query language that resembles a superset of a relational query language. Besides user-defined functions and operators which were illustrated earlier, the features which have been added to a traditional relational language include

路径表达式

path expressions

支持嵌套查询

support for nested queries

传递闭包

transitive closure

支持继承

support for inheritance

支持时间旅行。

support for time travel.

包含路径表达式是因为 POSTQUEL 允许分层引用包含其他构造类型的构造类型。例如,上面定义的 EMP 类型包含一个字段,该字段是构造类型 DEPT 的实例。因此,可以询问在一楼工作的员工姓名,如下:

Path expressions are included because POSTQUEL allows constructed types which contain other constructed types to be hierarchically referenced. For example, the EMP type defined above contains a field which is an instance of the constructed type, DEPT. Hence, one can ask for the names of employees who work on the first floor as follows:

检索 (EMP.name),其中 EMP.dept.floor = 1

retrieve (EMP.name) where EMP.dept.floor = 1

而不是被迫进行连接,例如,

rather than being forced to do a join, e.g.,

检索 (EMP.name),其中 EMP.dept = DEPT.OID

retrieve (EMP.name) where EMP.dept = DEPT.OID

 和 DEPT.floor = l

 and DEPT.floor = l

POSTQUEL 还允许嵌套查询,并具有将实例集作为操作数的运算符。例如,要查找占据一整层楼的部门,可以查询

POSTQUEL also allows queries to be nested and has operators that have sets of instances as operands. For example to find the departments which occupy an entire floor, one would query

检索(DEPT.dname)

retrieve (DEPT.dname)

其中 DEPT.floor NOTIN { D .floor 在 DEPT 中使用 D

where DEPT.floor NOTIN {D.floor using D in DEPT

 其中D .dname ! = DEPT.dname}

 where D.dname ! = DEPT.dname}

在这种情况下,大括号内的表达式表示一组实例,而 NOTIN 是一个将一组实例作为其右操作数的运算符。

In this case, the expression inside the curly braces represents a set of instances and NOTIN is an operator which takes a set of instances as its right operand.

传递闭包操作允许分解部件或祖先层次结构。例如考虑构造类型

The transitive closure operation allows one to explode a parts or ancestor hierarchy. Consider for example the constructed type

父母(年长、年幼)

parent (older, younger)

人们可以按如下方式询问约翰的所有祖先。

One can ask for all the ancestors of John as follows.

将*检索到答案(parent.older)

retrieve* into answer (parent.older)

使用 a 作为答案

using a in answer

其中parent.younger =“约翰”

where parent.younger = “John”

或者parent.younger = a.older

or parent.younger = a.older

在这种情况下,检索后的 * 指示应运行关联的查询,直到答案无法增长为止。

In this case, the * after retrieve indicates that the associated query should be run until the answer fails to grow.

如果想查找所有 40 岁以上员工的姓名,可以这样写

If one wishes to find the names of all employees over 40, one would write

在 EMP 中使用E检索 ( E .name)

retrieve (E.name) using E in EMP

其中 E.age > 40

where E.age > 40

另一方面,如果需要所有 40 岁以上销售人员或员工的姓名,则表示法为

On the other hand, if one wanted the names of all salesmen or employees over 40, the notation is

使用 EMP* 中的 E 检索 (E.name)

retrieve (E.name) using E in EMP*

其中E .age > 40

where E.age > 40

这里,构造类型 EMP 后面的 * 表示查询应该在 EMP 以及继承层次结构中 EMP 下的所有构造类型上运行。* 的这种使用允许用户轻松地对构造类型及其所有后代运行查询。

Here the * after the constructed type EMP indicates that the query should be run over EMP and all constructed types under EMP in the inheritance hierarchy. This use of * allows a user to easily run queries over a constructed type and all its descendents.

最后,POSTGRES 支持时间旅行的概念。此功能允许用户运行历史查询。例如,要查找 Sam 在时间T的工资,可以查询

Lastly, POSTGRES supports the notion of time travel. This feature allows a user to run historical queries. For example to find the salary of Sam at time T one would query

检索(EMP.工资)

retrieve (EMP.salary)

使用电磁脉冲[ T ]

using EMP [T]

其中 EMP.name = “Sam”

where EMP.name = “Sam”

POSTGRES 会自动找到 Sam 的记录在正确时间有效的版本,并获得相应的薪水。

POSTGRES will automatically find the version of Sam’s record valid at the correct time and get the appropriate salary.

与关系系统一样,POSTQUEL 命令的结果可以作为新的构造类型添加到数据库中。在这种情况下,POSTQUEL 遵循关系系统的做法,从结果中删除重复记录。有兴趣保留重复项的用户可以通过确保 OID 字段某些实例包含在所选目标列表中。对于 POSTQUEL 的完整描述,感兴趣的读者应该查阅[ 35 ]。

Like relational systems, the result of a POSTQUEL command can be added to the database as a new constructed type. In this case, POSTQUEL follows the lead of relational systems by removing duplicate records from the result. The user who is interested in retaining duplicates can do so by ensuring that the OID field of some instance is included in the target list being selected. For a full description of POSTQUEL the interested reader should consult [35].

II.D  快速路径

II.D  Fast Path

我们选择实施快速路径功能有三个原因。首先,希望通过执行一系列函数来导航到所需数据来与数据库交互的用户可以使用快速路径来实现其目标。其次,存在多种决策支持应用程序,其中为最终用户提供了专门的查询语言。在这种环境中,应用程序开发人员通常更容易为查询构建解析树表示,而不是 ASCII 表示。因此,应用程序设计者希望能够直接调用 POSTGRES 优化器或执行器。大多数 DBMS 不允许直接访问内部系统模块。

There are three reasons why we chose to implement a fast path feature. First, a user who wishes to interact with a database by executing a sequence of functions to navigate to desired data can use fast path to accomplish his objective. Second, there are a variety of decision support applications in which the end user is given a specialized query language. In such environments, it is often easier for the application developer to construct a parse tree representation for a query rather than an ASCII one. Hence, it would be desirable for the application designer to be able to directly call the POSTGRES optimizer or executor. Most DBMS’s do not allow direct access to internal system modules.

第三个原因有点复杂。在毕加索的持久CLOS层中,运行时系统有必要为每个持久的构造对象分配一个唯一标识符(OID)。系统不希望将每个对象直接同步插入到POSTGRES数据库中,从而为该对象分配POSTGRES标识符。这将导致执行持久 CLOS 程序时性能不佳。相反,持久性 CLOS 在程序的地址空间中维护对象的缓存,并且仅将持久性对象同步插入到该缓存中。有几个选项可以控制稍后如何将缓存写入数据库。不幸的是,持久化对象在进入缓存时必须被分配一个唯一的标识符,因为其他对象可能必须指向新创建的对象并使用其 OID 来执行此操作。

The third reason is a bit more complex. In the persistent CLOS layer of Picasso, it is necessary for the run-time system to assign a unique identifier (OID) to every constructed object that is persistent. It is undesirable for the system to synchronously insert each object directly into a POSTGRES database and thereby assign a POSTGRES identifier to the object. This would result in poor performance in executing a persistent CLOS program. Rather, persistent CLOS maintains a cache of objects in the address space of the program and only inserts a persistent object into this cache synchronously. There are several options which control how the cache is written out to the database at a later time. Unfortunately, it is essential that a persistent object be assigned a unique identifier at the time it enters the cache, because other objects may have to point to the newly created object and use its OID to do so.

如果持久 CLOS 分配唯一标识符,那么当将对象写出到数据库并分配真正的 POSTGRES 唯一标识符时,必须执行复杂的映射。另一方面,持久 CLOS 必须维护自己的唯一标识符系统,独立于 POSTGRES 系统,这显然是重复工作。选择的解决方案是允许持久 CLOS 访问分配唯一标识符的 POSTGRES 例程,并允许它预先分配N 个POSTGRES 对象标识符,随后可以将这些标识符分配给缓存的对象。稍后,可以使用预先分配的唯一标识符将这些对象写入 POSTGRES 数据库。当标识符的供应耗尽时,持久 CLOS 可以请求另一个收集。

If persistent CLOS assigns unique identifiers, then there will be a complex mapping that must be performed when objects are written out to the database and real POSTGRES unique identifiers are assigned. Alternately, persistent CLOS must maintain its own system for unique identifiers, independent of the POSTGRES one, an obvious duplication of effort. The solution chosen was to allow persistent CLOS to access the POSTGRES routine that assigns unique identifiers and allow it to preassign N POSTGRES object identifiers which it can subsequently assign to cached objects. At a later time, these objects can be written to a POSTGRES database using the preassigned unique identifiers. When the supply of identifiers is exhausted, persistent CLOS can request another collection.

在所有这些示例中,应用程序需要直接访问用户定义的或内部的 POSTGRES 函数,因此 POSTGRES 查询语言已扩展为

In all of these examples, an application program requires direct access to a user-defined or internal POSTGRES function, and therefore the POSTGRES query language has been extended with

函数名称(参数列表)

function-name (param-list)

在这种情况下,除了在 POSTQUEL 中运行查询之外,用户还可以要求执行 POSTGRES 已知的任何函数。该函数可以是用户先前定义为普通函数、运算符或 POSTQUEL 函数的函数,也可以是包含在 POSTGRES 实现中的函数。

In this case, besides running queries in POSTQUEL, a user can ask that any function known to POSTGRES be executed. This function can be one that a user has previously defined as a normal, operator, or POSTQUEL function or it can be one that is included in the POSTGRES implementation.

因此,用户可以直接调用解析器、优化器、执行器、访问方法、缓冲区管理器或实用程序。此外,他还可以定义函数来调用 POSTGRES 内部。通过这种方式,他可以对低级控制流进行相当大的控制,就像通过 Exodus [ 20 ] 等 DBMS 工具包实现的那样,但无需从工具包配置定制的 DBMS 中付出所有努力。此外,如果用户希望通过进行一系列函数调用(方法调用)来与他的数据库进行交互,则此功能允许实现这种可能性。正如简介中所指出的,我们并不期望这个界面特别受欢迎。

Hence, the user can directly call the parser, the optimizer, the executor, the access methods, the buffer manager, or the utility routines. In addition, he can define functions which in turn make calls on POSTGRES internals. In this way, he can have considerable control over the low-level flow of control, much as is available through a DBMS toolkit such as Exodus [20], but without all the effort involved in configuring a tailored DBMS from the toolkit. Moreover, should the user wish to interact with his database by making a collection of function calls (method invocations), this facility allows the possibility. As noted in the Introduction, we do not expect this interface to be especially popular.

上述能力被称为快速路径,因为它提供了对特定功能的直接访问,而无需检查参数的有效性。因此,它实际上是一个远程过程调用工具,并允许用户程序调用另一个地址空间中的函数而不是其自己的地址空间中的函数。

The above capability is called fast path because it provided direct access to specific functions without checking the validity of parameters. As such, it is effectively a remote procedure call facility and allows a user program to call a function in another address space rather than in its own address space.

II.E   POSTGRES 是面向对象的吗?

II.E  Is POSTGRES Object-Oriented?

过去几年提出了许多下一代数据模型。有些被称为“扩展关系”,有些被认为是“面向对象”,而另一些则被称为“嵌套关系”。POSTGRES 可以准确地描述为面向对象的系统,因为它包括对象的唯一标识、抽象数据类型、类(构造类型)、方法(函数)以及数据和函数的继承。其他人(例如,[ 2 ])建议对“面向对象”一词进行定义,并且 POSTGRES 几乎满足所有建议的石蕊测试。

There have been many next-generation data models proposed in the last few years. Some are characterized by the term “extended relational,” others are considered “object-oriented,” while yet others are termed “nested relational.” POSTGRES could be accurately described as an object-oriented system because it includes unique identity for objects, abstract data types, classes (constructed types), methods (functions), and inheritance for both data and functions. Others (e.g., [2]) are suggesting definitions for the word “object-oriented,” and POSTGRES satisfies virtually all of the proposed litmus tests.

另一方面,POSTGRES 也可以被认为是一个扩展的关系系统。正如前面的脚注中所指出的,第二节也可以用“构造类型”和“实例”一词替换为“关系”和“元组”来写得同样好。事实上,在之前的 POSTGRES [ 26 ] 描述中,就使用了这种表示法。因此,其他人,例如[ 18 ],将 POSTGRES 描述为扩展的关系系统。

On the other hand, POSTGRES could also be considered an extended relational system. As noted in a previous footnote, Section II could have been equally well written with the word “constructed type” and “instance” replaced by the words “relation” and “tuple.” In fact, in previous descriptions of POSTGRES [26], this notation was employed. Hence, others, e.g., [18], have characterized POSTGRES as an extended relational system.

最后,POSTGRES 支持 POSTQUEL 类型,这正是一种嵌套关系结构。因此,POSTGRES 也可以归类为嵌套关系系统。

Lastly, POSTGRES supports the POSTQUEL type, which is exactly a nested relational structure. Consequently, POSTGRES could be classified as a nested relational system as well.

因此,可以使用上述三个形容词中的任何一个来描述 POSTGRES。我们认为,在描述 POSTGRES 时,我们可以互换使用关系、类构造类型这些词。此外,我们还可以互换使用函数方法这两个词。最后,我们可以互换使用实例、记录元组这三个词。因此,POSTGRES 看起来要么是面向对象的,要么不是面向对象的,具体取决于解析器中一些标记的选择。因此,我们认为,对下一代数据库系统中的扩展数据模型进行分类的大部分努力都是表面语法上的愚蠢练习。

As a result, POSTGRES could be described using any of the three adjectives above. In our opinion, we can interchangeably use the words relations, classes, and constructed types in describing POSTGRES. Moreover, we can also interchangeably use the words function and method. Lastly, we can interchangeably use the words instance, record, and tuple. Hence, POSTGRES seems to be either object-oriented or not object-oriented, depending on the choice of a few tokens in the parser. As a result, we feel that most of the efforts to classify the extended data models in next-generation database systems are silly exercises in surface syntax.

在本节的其余部分中,我们将简要评论 OID 和继承的 POSTGRES 实现。POSTGRES 为每个记录提供一个唯一标识符 (OID),然后允许应用程序设计者为每个构造类型决定是否希望在 OID 字段上拥有索引。这个决定应该与大多数面向对象的系统形成对比,后者会自动为系统中的所有构造类型构造 OID 索引。POSTGRES 方案允许仅针对那些可盈利的对象类型支付索引成本。我们认为,这种灵活性是一个很好的决定。

In the remainder of this section, we comment briefly on the POSTGRES implementation of OID’s and inheritance. POSTGRES gives each record a unique identifier (OID), and then allows the application designer to decide for each constructed type whether he wishes to have an index on the OID field. This decision should be contrasted with most object-oriented systems which construct an OID index for all constructed types in the system automatically. The POSTGRES scheme allows the cost of the index to be paid only for those types of objects for which it is profitable. In our opinion, this flexibility has been an excellent decision.

其次,有几种可能的方法来实现继承层次结构。考虑前面提到的 SALESMEN 和 EMP 示例,可以通过将 SALESMAN 实例存储为 EMP 记录,然后仅将额外的配额信息存储在单独的 SALESMAN 记录中来存储 SALESMAN 实例。另一种方法是,可以在 EMP 中不存储每个销售员的信息,然后将完整的销售员记录存储在其他地方。显然,还有多种附加方案。

Second, there are several possible ways to implement an inheritance hierarchy. Considering the SALESMEN and EMP example noted earlier, one can store instances of SALESMAN by storing them as EMP records and then only storing the extra quota information in a separate SALESMAN record. Alternately, one can store no information on each salesman in EMP and then store complete SALESMAN records elsewhere. Clearly, there are a variety of additional schemes.

POSTGRES 选择了一种实现方式,即将所有 SALESMAN 字段存储在单个记录中。然而,应用程序设计者可能会需要几种其他表示形式,以使他们能够灵活地优化其特定数据。继承的未来实现可能需要多种存储选项。

POSTGRES chose one implementation, namely storing all SALESMAN fields in a single record. However, it is likely that applications designers will demand several other representations to give them the flexibility to optimize their particular data. Future implementations of inheritance will likely require several storage options.

II.F  对 POSTGRES 数据模型的批判

II.F  A Critique of the POSTGRES Data Model

我们认为 POSTGRES 数据模型在五个方面犯了错误:

There are five areas where we feel we made mistakes in the POSTGRES data model:

联合类型

union types

访问方法接口

access method interface

功能

functions

大物体

large objects

数组。

arrays.

我们依次讨论每一个。

We discuss each in turn.

任何下一代 DBMS 中的一个理想功能都是支持联合类型,即类型的实例可以是多个给定类型之一的实例。一个有说服力的例子(类似于[ 10 ]中的一个)是员工可以借调到另一家工厂或借调给客户。如果存在两种基本类型(客户和工厂),则希望将 EMP 类型更改为

A desirable feature in any next-generation DBMS would be to support union types, i.e., an instance of a type can be an instance of one of several given types. A persuasive example (similar to one from [10]) is that employees can be on loan to another plant or on loan to a customer. If two base types, customer and plant, exist, one would like to change the EMP type to

创建 EMP(名称 = c 12,部门 = DEPT,工资 =

create EMP (name = c12, dept = DEPT, salary =

 浮动、借用 = 工厂或客户)

 float, on-loan-to = plant or customer)

不幸的是,包含联合类型会使查询优化器变得更加复杂。例如,要查找借调到同一组织的所有员工,可以提出查询

Unfornately including union types makes a query optimizer more complex. For example, to find all the employees on loan to the same organization one would state the query

检索(EMP.名称,E .名称)

retrieve (EMP.name, E.name)

在 EMP 中使用E

using E in EMP

其中 EMP.on-loan-to = E .on-loan-to

where EMP.on-loan-to = E.on-loan-to

然而,优化器必须构建两种不同的计划,一种针对借调到客户的员工,另一种针对借调到其他工厂的员工。存在两个计划的原因是两种类型的相等运算符可能不同。此外,必须在联合字段上构建索引,这会导致访问方法相当复杂。

However, the optimizer must construct two different plans, one for employees on loan to a customer and one for employees on loan to a different plant. The reason for two plans is that the equality operator may be different for the two types. In addition, one must construct indexes on union fields, which entails substantial complexity in the access methods.

联合类型在某些应用中是非常理想的,我们考虑了关于联合类型的三种可能的立场:

Union types are highly desirable in certain applications, and we considered three possible stances with respect to union types:

1)仅通过抽象数据类型支持

1)  support only through abstract data types

2)通过POSTQUEL函数支持

2)  support through POSTQUEL functions

3)全力支持。

3)  full support.

使用 POSTGRES 抽象数据类型工具可以轻松构建联合类型。如果用户想要特定的联合类型,他可以构造它,然后为该类型编写适当的运算符和函数。因此,联合类型的实现复杂性被强制纳入运算符和函数的例程以及类型的实现者中。此外,很明显,存在大量的联合类型,应用程序设计者必须构建一个广泛的类型库。毕加索团队表示,这种做法给他们带来了难以接受的困难负担因此立场1)被拒绝。

Union types can be easily constructed using the POSTGRES abstract data type facility. If a user wants a specific union type, he can construct it and then write appropriate operators and functions for the type. The implementation complexity of union types is thus forced into the routines for the operators and functions and onto the implementor of the type. Moreover, it is clear that there are a vast number of union types and an extensive type library must be constructed by the application designer. The Picasso team stated that this approach placed an unacceptably difficult burden on them, and therefore position 1) was rejected.

位置 2) 为联合类型提供了一些支持,但存在问题。考虑[ 26 ]中的员工及其爱好的例子。

Position 2) offers some support for union types but has problems. Consider the example of employees and their hobbies from [26].

创建 EMP(名称 = c 12,爱好 = POSTQUEL)

create EMP (name = c12, hobbies = POSTQUEL)

这里的爱好字段是一个 POSTQUEL 函数,每个员工一个,它检索有关该特定员工的所有爱好信息。现在考虑以下 POSTQUEL 查询:

Here the hobbies field is a POSTQUEL function, one per employee, which retrieves all hobby information about that particular employee. Now consider the following POSTQUEL query:

检索(EMP.爱好.平均)

retrieve (EMP.hobbies.average)

 其中 EMP.name = “Fred”

 where EMP.name = “Fred”

在这种情况下,只要定义了每个爱好记录,就会返回它的字段平均值。然而,假设垒球爱好的平均值是浮点数,板球爱好的平均值是整数。在这种情况下,应用程序必须准备好接受不同类型的值。

In this case, the field average for each hobby record will be returned whenever it is defined. Suppose, however, that average is a float for the softball hobby and an integer for the cricket hobby. In this case, the application program must be prepared to accept values of different types.

更困难的问题是以下合法的 POSTQUEL 查询:

The more difficult problem is the following legal POSTQUEL query:

检索到 TEMP(结果 = EMP.hobbies.average)

retrieve into TEMP (result = EMP.hobbies.average)

 其中 EMP.name = “Fred”

 where EMP.name = “Fred”

在这种情况下,会出现关于结果字段的类型的问题,因为它是联合类型。因此,采用位置 2) 会使人处于一种尴尬的境地,即上述查询的结果没有合理的类型。

In this case, a problem arises concerning the type of the result field, because it is a union type. Hence, adopting position 2) leaves one in an awkward position of not having a reasonable type for the result of the above query.

当然,位置 3) 需要扩展索引和查询优化例程来处理联合类型。我们的解决方案是采用位置 2) 并添加一个抽象数据类型 ANY,它可以保存任何类型的实例。该解决方案将上述查询结果的类型从

Of course, position 3) requires extending the indexing and query optimization routines to deal with union types. Our solution was to adopt position 2) and to add an abstract data type, ANY, which can hold an instance of any type. This solution which turns the type of the result of the above query from

{整数、浮点数}之一

one-of {integer, float}

进入任何都不是很令人满意。不仅信息丢失,而且我们还被迫在 POSTGRES 中包含这种通用类型。

into ANY is not very satisfying. Not only is information lost, but we are also forced to include with POSTGRES this universal type.

我们认为,唯一现实的选择是采用位置 3),承受复杂性的增加,这就是我们在任何下一个系统中都会做的事情。

In our opinion, the only realistic alternative is to adopt position 3), swallow the complexity increase, and that is what we would do in any next system.

另一个失败涉及访问方法设计,并且决定仅支持字段值而不是值函数的索引。[ 17 ]中讨论了值函数索引的实用性,并且该功能被相当不优雅地改造为 POSTGRES [ 4 ] 的一个版本。

Another failure concerned the access method design and was the decision to support indexing only on the value of a field and not on a function of a value. The utility of indexes on functions of values is discussed in [17], and the capability was retrofitted, rather inelegantly, into one version of POSTGRES [4].

关于访问方法设计的另一个评论涉及可扩展性。因为用户可以动态添加新的基本类型,所以如果系统没有提供支持对其类型进行有效访问的访问方法,用户也必须能够向 POSTGRES 添加新的访问方法。此功能的标准示例是使用R树 [ 15 ] 来加速对几何对象的访问。除了B + 树之外,我们现在还为 POSTGRES 设计和/或编码了三种访问方法。我们的经验一直是添加访问方法很难_。有四个问题使情况变得复杂。首先,访问方法必须包括对 POSTGRES 锁定子系统的显式调用,以设置和释放访问方法对象上的锁。因此,新访问方法的设计者必须了解锁定以及如何使用特定的 POSTGRES 设施。其次,设计人员必须了解如何与缓冲区管理器连接,并能够获取、放置、固定和取消固定页面。接下来,POSTGRES 执行引擎包含任何查询执行的“状态”,并且访问方法必须理解该状态的部分以及所涉及的数据结构。最后但并非最不重要的一点是,设计者必须编写 13 个重要的例程。到目前为止我们的经验是,新手程序员可以向 POSTGRES 添加新类型;然而,它需要高度熟练的程序员来添加新的访问方法。

Another comment on the access method design concerns extendibility. Because a user can add new base types dynamically, it is essential that he also be able to add new access methods to POSTGRES if the system does not come with an access method that supports efficient access to his types. The standard example of this capability is the use of R-trees [15] to speed access to geometric objects. We have now designed and/or coded three access methods for POSTGRES in addition to B+-trees. Our experience has consistently been that adding an access method is very hard. There are four problems that complicate the situation. First, the access method must include explicit calls to the POSTGRES locking subsystem to set and release locks on access method objects. Hence, the designer of a new access method must understand locking and how to use the particular POSTGRES facilities. Second, the designer must understand how to interface to the buffer manager and be able to get, put, pin, and unpin pages. Next, the POSTGRES execution engine contains the “state” of the execution of any query and the access methods must understand portions of this state and the data structures involved. Last, but not least, the designer must write 13 nontrivial routines. Our experience so far is that novice programmers can add new types to POSTGRES; however, it requires a highly skilled programmer to add a new access method. Put differently, the manual on how to add new data types to POSTGRES is two pages long, the one for access methods is 50 pages.

我们没有意识到访问方法构建的难度。因此,我们设计了一个系统,允许最终用户动态地将访问方法添加到正在运行的系统中。然而,访问方法将由复杂的系统程序员构建,他们本可以使用更简单的接口。

We failed to realize the difficulty of access method construction. Hence, we designed a system that allows end users to add access methods dynamically to a running system. However, access methods will be built by sophisticated system programmers who could have used a simpler interface.

我们的设计存在缺陷的第三个方面涉及 POSTGRES 对 POSTQUEL 函数的支持。目前,POSTGRES 中的此类函数是查询语言 POSTQUEL 中的命令的集合。如果将 DEPT 中的预算定义为 POSTQUEL 函数,则鞋类部门预算的值可能是以下命令:

A third area where our design is flawed concerns POSTGRES support for POSTQUEL functions. Currently, such functions in POSTGRES are collections of commands in the query language POSTQUEL. If one defined budget in DEPT as a POSTQUEL function, then the value for the shoe department budget might be the following command:

检索 (DEPT.budget) 其中 DEPT.dname =

retrieve (DEPT.budget) where DEPT.dname =

 “糖果”

 “candy”

在这种情况下,鞋类部门将自动分配与糖果部门相同的预算。然而,鞋业部门的预算不可能具体规定为

In this case, the shoe department will automatically be assigned the same budget as the candy department. However, it is impossible for the budget of the shoe department to be specified as

如果楼层 = 1 那么

if floor = 1 then

 检索 (DEPT.budget) 其中 DEPT.dname =

 retrieve (DEPT.budget) where DEPT.dname =

  “糖果”

  “candy”

别的

else

 检索 (DEPT.budget) 其中 DEPT.dname =

 retrieve (DEPT.budget) where DEPT.dname =

  “玩具”

  “toy”

该规范定义了鞋类部门的预算与糖果部门的预算(如果位于一楼)。否则就和玩具部一样了。此查询不可能,因为 POSTQUEL 没有条件表达式。我们对此以及 POSTQUEL 的其他扩展进行了广泛的讨论。每个这样的扩展都被拒绝,因为它似乎将 POSTQUEL 变成了一种编程语言而不是查询语言。

This specification defines the budget of the shoe department to the candy department budget if it is on the first floor. Otherwise, it is the same as the toy department. This query is not possible because POSTQUEL has no conditional expressions. We had extensive discussions about this and other extensions to POSTQUEL. Each such extension was rejected because it seemed to turn POSTQUEL into a programming language and not a query language.

更好的解决方案是允许 POSTQUEL 函数可以用通过 POSTQUEL 查询增强的通用编程语言来表达。因此,普通函数和 POSTQUEL 函数之间没有区别。换句话说,普通函数将能够构造类型并且支持路径表达式。

A better solution would be to allow a POSTQUEL function to be expressible in a general purpose programming language enhanced with POSTQUEL queries. Hence, there would be no distinction between normal functions and POSTQUEL functions. Put differently, normal functions would be able to be constructed types and would support path expressions.

这种方法存在三个问题。首先,普通函数的路径表达式无法由 POSTGRES 查询优化器优化,因为它们具有任意语义。因此,大多数为 POSTQUEL 函数计划的优化都必须被放弃。其次,POSTQUEL 函数比普通函数更容易定义,因为用户不需要了解通用编程语言。此外,他不需要指定函数参数的类型或返回类型,因为 POSTGRES 可以从查询规范中找出这些。因此,从 POSTQUEL 函数转向普通函数时,我们将不得不放弃定义的简便性。最后,普通函数存在保护问题,因为它们可以执行任意操作,例如将数据库清零。

There are three problems with this approach. First, path expressions for normal functions cannot be optimized by the POSTGRES query optimizer because they have arbitrary semantics. Hence, most of the optimizations planned for POSTQUEL functions would have to be discarded. Second, POSTQUEL functions are much easier to define than normal functions because a user need not know a general purpose programming language. Also, he need not specify the types of the function arguments or the return type because POSTGRES can figure these out from the query specification. Hence, we would have to give up ease of definition in moving from POSTQUEL functions to normal functions. Lastly, normal functions have a protection problem because they can do arbitrary things, such as zeroing the database. POSTGRES deals with this problem by calling normal functions in two ways:

可信——加载到 POSTGRES 地址空间

trusted—loaded into the POSTGRES address space

不受信任——加载到单独的地址空间中。

untrusted—loaded into a separate address space.

因此,普通函数要么在没有安全性的情况下快速调用,要么以受保护的方式缓慢调用。POSTQUEL 函数不会出现此类安全问题。

Hence, normal functions are either called quickly with no security or slowly in a protected fashion. No such security problem arises with POSTQUEL functions.

更好的方法可能是支持用为Picasso设计的第四代语言 (4GL) 编写的 POSTQUEL 函数[ 22 ]。该编程系统将类型信息保留在系统目录中。因此,不需要单独的注册步骤来向 POSTGRES 指示类型信息。此外,该语言的处理器可用于集成到 POSTGRES 中。使 4GL 变得“安全”也很容易,即无法执行野分支或恶意操作。因此,不会有安全问题。此外,路径表达式似乎可以针对 4GL 函数进行优化。

A better approach might have been to support POSTQUEL functions written in the fourth generation language (4GL) being designed for Picasso [22]. This programming system leaves type information in the system catalogs. Consequently, there would be no need for a separate registration step to indicate type information to POSTGRES. Moreover, a processor for the language is available for integration in POSTGRES. It is also easy to make a 4GL “safe,” i.e., unable to perform wild branches or malicious actions. Hence, there would be no security problem. Also, it seems possible that path expressions could be optimized for 4GL functions.

当前的商业关系产品似乎正在朝这个方向发展,允许数据库过程用其专有的第四代语言(4GL)进行编码。回想起来,我们可能应该认真考虑设计 POSTGRES 以支持用 4GL 编写的函数。

Current commercial relational products seem to be moving in this direction by allowing database procedures to be coded in their proprietary fourth generation languages (4GL’s). In retrospect, we probably should have looked seriously at designing POSTGRES to support functions written in a 4GL.

接下来,POSTGRES 允许构造任意大小的类型。因此,大位图是完全可以接受的 POSTGRES 数据类型。然而,当前的 POSTGRES 用户界面(门户)允许用户获取构造类型的一个或多个实例。目前无法仅获取实例的一部分。这给应用程序带来了严重的缓冲问题;它必须能够接受整个实例,无论它有多大。我们应该以一种简单的方式扩展门户语法,以允许应用程序将门户定位在构造类型实例的特定字段上,然后指定他想要检索的字节数。这些更改将使插入和检索大字段变得更加容易。

Next, POSTGRES allows types to be constructed that are of arbitrary size. Hence, large bitmaps are a perfectly acceptable POSTGRES data type. However, the current POSTGRES user interface (portals) allows a user to fetch one or more instances of a constructed type. It is currently impossible to fetch only a portion of an instance. This presents an application program with a severe buffering problem; it must be capable of accepting an entire instance, no matter how large it is. We should extend the portal syntax in a straightforward way to allow an application to position a portal on a specific field of an instance of a constructed type and then specify a byte count that he would like to retrieve. These changes would make it much easier to insert and retrieve big fields.

最后,我们在 POSTGRES 数据模型中包含了数组。因此,可以将 SALESMAN 类型指定为

Lastly, we included arrays in the POSTGRES data model. Hence, one could have specified the SALESMAN type as

创建销售员(姓名 = c12,部门 = 部门,

create SALESMAN (name = c12, dept = DEPT,

 工资=浮动,配额=浮动[12])

 salary = float, quota = float[12])

在这里,SALESMAN 拥有 EMP 的所有字段以及一个配额,该配额是一个由 12 个浮点组成的数组,一个代表一年中的每个月。事实上,字符串实际上是一个字符数组,上述类型的正确表示法是

Here, the SALESMAN has all the fields of EMP plus a quota which is an array of 12 floats, one for each month of the year. In fact, character strings are really an array of characters, and the correct notation for the above type is

创建销售员(姓名 = c[12],部门 = 部门,

create SALESMAN (name = c[12], dept = DEPT,

 工资=浮动,配额=浮动[12])

 salary = float, quota = float[12])

在 POSTGRES 中,我们支持基本类型的固定长度和可变长度数组,以及 POSTQUEL 中的数组表示法。例如,要请求所有 4 月份配额超过 1000 的销售人员,可以这样写:

In POSTGRES, we support fixed and variable length arrays of base types, along with an array notation in POSTQUEL. For example, to request all salesmen who have an April quota over 1000, one would write

检索 (SALESMAN.name) 其中

retrieve (SALESMAN.name) where

 SALESMAN.quota[4] > 1000

 SALESMAN.quota[4] > 1000

但是,我们不支持构造类型的数组;因此,不可能有构造类型的实例数组。我们省略此功能只是因为它会使查询优化器和执行器变得更加困难。此外,没有针对数组元素的内置搜索机制。例如,不可能找到一年中任何一个月配额超过 1000 名的所有销售人员的姓名。回想起来,我们应该包含对数组的一般支持或根本不支持。

However, we do not support arrays of constructed types; hence, it is not possible to have an array of instances of a constructed type. We omitted this capability only because it would have made the query optimizer and executor somewhat harder. In addition, there is no built-in search mechanism for the elements of an array. For example, it is not possible to find the names of all salesmen who have a quota over 1000 during any month of the year. In retrospect, we should have included general support for arrays or no support at all.

三、  规则体系

III  The Rules System

三、A  简介

III.A  Introduction

我们很清楚,所有 DBMS 都需要一个规则系统。当前的商业系统需要支持引用完整性[ 12 ],这只是简单的规则集合。此外,大多数当前系统都有专用规则系统来支持关系视图、保护和完整性约束。最后,规则系统允许用户进行事件驱动编程,并强制执行其他方式无法执行的完整性约束。POSTGRES 团队必须就规则系统的理念做出三项高层决策。

It is clear to us that all DBMS’s need a rules system. Current commercial systems are required to support referential integrity [12], which is merely a simple-minded collection of rules. In addition, most current systems have special purpose rules systems to support relational views, protection, and integrity constraints. Lastly, a rules system allows users to do event-driven programming as well as enforce integrity constraints that cannot be performed in other ways. There are three highlevel decisions that the POSTGRES team had to make concerning the philosophy of rule systems.

首先,需要决定有多少规则语法。一些方法,例如[ 13 ]、[ 36 ],提出了面向应用程序设计者的规则系统,这些规则系统将增强用于 DBMS 内部目的的其他规则系统。因此,这样的系统将包含几个独立运行的规则系统。另一方面,[ 25 ]提出了一个规则系统,试图以单一语法支持用户功能以及所需的 DBMS 内部功能。

First, a decision was required concerning how many rule syntaxes there would be. Some approaches, e.g., [13], [36], propose rule systems oriented toward application designers that would augment other rule systems present for DBMS internal purposes. Hence, such systems would contain several independently functioning rules systems. On the other hand, [25] proposed a rule system that tried to support user functionality as well as needed DBMS internal functions in a single syntax.

从一开始,POSTGRES 规则系统的目标就是只有一种语法。人们认为这将简化用户界面,因为应用程序设计者只需要学习一种构造。此外,在一项功能可以由多个规则系统执行的情况下,他们不必决定使用哪个系统。人们还认为,单一的规则体系将缓解可能面临的实施困难。

From the beginning, a goal of the POSTGRES rules system was to have only one syntax. It was felt that this would simplify the user interface, since application designers need learn only one construct. Also, they would not have to deal with deciding which system to use in the cases where a function could be performed by more than one rules system. It was also felt that a single rules system would ease the implementation difficulties that would be faced.

其次,有两种支持规则系统的实施理念。第一个是查询重写实现。在这里,将通过在执行之前将用户查询转换为替代形式来应用规则。此转换在查询语言解析器和优化器之间执行。对视图[ 24 ]的支持以及许多对递归查询支持的建议[ 5 ]、[ 33 ]都是通过这种方式完成的。当任何给定的构造类型上有少量规则并且大多数规则覆盖整个构造类型时,这样的实现将非常有效。例如,诸如以下的规则

Second, there are two implementation philosophies by which one could support a rule system. The first is a query rewrite implementation. Here, a rule would be applied by converting a user query to an alternate form prior to execution. This transformation is performed between the query language parser and the optimizer. Support for views [24] is done this way along with many of the proposals for recursive query support [5], [33]. Such an implementation will be very efficient when there are a small number of rules on any given constructed type and most rules cover the whole constructed type. For example, a rule such as

EMP [dept] 包含在 DEPT[dname] 中

EMP [dept] contained-in DEPT[dname]

表达了员工不能在不存在的部门的参照完整性条件,并且适用于所有 EMP 实例。然而,如果每个构造的规则都有大量规则,则查询重写实现将无法正常工作。类型,每个类型仅涵盖少数实例。例如,考虑以下三个规则:

expresses the referential integrity condition that employees cannot be in a nonexistent department and applies to all EMP instances. However, a query rewrite implementation will not work well if there are a large number of rules on each constructed type, each of them covering only a few instances. Consider, for example, the following three rules:

鞋业部门的员工有一张钢制办公桌

employees in the shoe department have a steel desk

40岁以上员工有木办公桌

employees over 40 have a wood desk

糖果部门的员工没有办公桌。

employees in the candy department do not have a desk.

要检索 Sam 拥有的桌子类型,必须运行以下三个查询:

To retrieve the kind of a desk that Sam has, one must run the following three queries:

检索 (desk = “steel”) 其中 EMP.name = “Sam”

retrieve (desk = “steel”) where EMP.name = “Sam”

 和 EMP.dept =“鞋”

 and EMP.dept = “shoe”

检索 (desk = “wood”) where EMP.name= “Sam”

retrieve (desk = “wood”) where EMP.name= “Sam”

 且 EMP 年龄 > 40

 and EMP.age > 40

检索 (desk = null) 其中 EMP.name = “Sam”

retrieve (desk = null) where EMP.name = “Sam”

 和 EMP.dept = “糖果”

 and EMP.dept = “candy”

因此,必须为每个规则重写用户查询,从而导致性能严重下降,除非使用多种查询优化技术将所有查询作为一个组进行处理[ 23 ]。

Hence, a user query must be rewritten for each rule, resulting in a serious degradation of performance unless all queries are processed as a group using multiple query optimization techniques [23].

此外,查询重写系统在处理异常方面存在很大困难[ 8 ]。例如,考虑“所有员工都有一张钢制办公桌”这一规则以及“琼斯是一位有一张木办公桌的员工”的例外情况。如果询问所有 35 岁以上员工的办公桌类型和年龄,则必须将查询重写为以下两个查询:

Moreover, a query rewrite system has great difficulty with exceptions [8]. For example, consider the rule “all employees have a steel desk” together with the exception “Jones is an employee who has a wood desk.” If one asks for the kind of desk and age for all employees over 35, then the query must be rewritten as the following two queries:

retrive (desk = “steel”, EMP.age) 其中 EMP.age

retrive (desk = “steel,” EMP.age) where EMP.age

 > 35 和 EMP.name !=“琼斯”

 > 35 and EMP.name ! = “Jones”

检索(桌子=“木头”,EMP.age)其中EMP.age

retrieve (desk = “wood,” EMP.age) where EMP.age

 > 35 且 EMP.name =“琼斯”

 > 35 and EMP.name = “Jones”

一般来说,查询的数量及其资格的复杂性随着规则的数量线性增加。同样,除非应用多种查询优化技术,否则这将导致性能不佳。

In general, the number of queries as well as the complexity of their qualifications increases linearly with the number of rules. Again, this will result in bad performance unless multiple query optimization techniques are applied.

最后,查询重写系统对于解决违反规则的情况没有提供任何帮助。例如,上面的引用完整性规则没有说明如果用户尝试将员工插入不存在的部门该怎么办。

Lastly, a query rewrite system does not offer any help in resolving situations when the rules are violated. For example, the above referential integrity rule is silent on what to do if a user tries to insert an employee into a nonexistent department.

另一方面,可以采用基于单个记录访问和数据库更新的触发器实现。每当访问、插入、删除或修改记录时,低级执行代码都会使旧记录和新记录随时可用。因此,可以轻松采取各种行动通过低级代码。这样的实现需要将规则触发代码放置在查询执行例程的深处。如果有很多规则,每个规则只影响几个实例,那么它会很好地工作,并且很容易在这个级别成功地解决冲突。然而,规则触发在执行器深处,因此查询优化器不可能为唤醒的规则链构造有效的执行计划。

On the other hand, one could adopt a trigger implementation based on individual record accesses and updates to the database. Whenever a record is accessed, inserted, deleted, or modified, the low-level execution code has both the old record and the new record readily available. Hence, assorted actions can easily be taken by the low-level code. Such an implementation requires the rule firing code to be placed deep in the query execution routines. It will work well if there are many rules each affecting only a few instances, and it is easy to deal successfully with conflict resolution at this level. However, rule firing is deep in the executor, and it is thereby impossible for the query optimizer to construct an efficient execution plan for a chain of rules that are awakened.

因此,该实现补充了查询重写方案,因为它在重写方案较弱的情况下表现出色,反之亦然。由于我们想要拥有一个单一的规则系统,因此很明显我们需要提供两种实现方式。

Hence, this implementation complements a query rewrite scheme in that it excels where a rewrite scheme is weak and vice-versa. Since we wanted to have a single rule system, it was clear that we needed to provide both styles of implementation.

我们面临的第三个问题是规则系统的范式。过去已经探索过由 if-then 规则集合组成的传统生产系统 [ 13 ]、[ 25 ],并且是一种现成的替代方案。然而,这样的方案缺乏表现力。例如,假设有人想要强制执行一项规则,即乔的薪水与弗雷德相同。在这种情况下,必须指定两个不同的 if-then 规则。第一个指示如果 Fred 获得加薪则要采取的操作,即将变化传播给 Joe。第二条规则指定必须拒绝对 Joe 工资的任何更新。因此,许多用户规则需要两个或多个 if-then 规范才能达到所需的效果。

A third issue that we faced was the paradigm for the rules system. A conventional production system consisting of collections of if-then rules has been explored in the past [13], [25] and is a readily available alternative. However, such a scheme lacks expressive power. For example, suppose one wants to enforce a rule that Joe makes the same salary as Fred. In this case, one must specify two different if-then rules. The first one indicates the action to take if Fred receives a raise, namely to propagate the change on to Joe. The second rule specifies that any update to Joe’s salary must be refused. Hence, many user rules require two or more if-then specifications to achieve the desired effect.

POSTGRES 的目的是探索更强大的范例。基本上,任何 POSTGRES 命令都可以通过更改命令的语义来转换为规则,以便它在逻辑上要么始终运行,要么从不运行。例如,规则可能会指定 Joe 的工资与 Fred 相同

The intent in POSTGRES was to explore a more powerful paradigm. Basically, any POSTGRES command can be turned into a rule by changing the semantics of the command so that it is logically either always running or never running. For example, Joe may be specified to have the same salary as Fred by the rule

始终替换 EMP(工资 = E .salary)

always replace EMP (salary = E.salary)

在 EMP 中使用E

using E in EMP

其中 EMP.name = “Fred” 且 E.name = “Joe”

where EMP.name = “Fred” and E.name = “Joe”

该单一规范将把 Joe 的薪水传播给 Fred,并拒绝直接更新 Fred 的薪水。通过这种方式,单个“always”规则取代了生产规则语法中所需的两个语句。

This single specification will propagate Joe’s salary on to Fred as well as refuse direct updates to Fred’s salary. In this way, a single “always” rule replaces the two statements needed in a production rule syntax.

此外,为了有效支持单个构造类型存在大量规则且每个规则仅适用于少数实例的触发实现,POSTGRES 团队设计了一种复杂的标记方案,将规则唤醒信息放置在各个实例上。因此,无论单个构造类型存在多少规则,只有那些实际上必须触发的规则才会被唤醒。这应该与没有这种数据结构的提案形成对比,每当单个构造类型存在大量规则时,这种数据结构的效率就会低得令人绝望。

Moreover, to efficiently support the triggering implementation where there are a large number of rules present for a single constructed type, each of which applies to only a few instances, the POSTGRES team designed a sophisticated marking scheme whereby rule wakeup information is placed on individual instances. Consequently, regardless of the number of rules present for a single constructed type, only those which actually must fire will be awakened. This should be contrasted to proposals without such data structures, which will be hopelessly inefficient whenever a large number of rules are present for a single constructed type.

最后,决定通过将标记升级到构造类型级别来支持查询重写方案。例如,考虑规则

Lastly, the decision was made to support the query rewrite scheme by escalating markers to the constructed type level. For example, consider the rule

始终替换 EMP (age = 40),其中 name ! =

always replace EMP (age = 40) where name ! =

 “账单”

 “Bill”

这条规则适用于除比尔之外的所有员工,标记每个员工会浪费空间。相反,人们更愿意在系统目录中设置单个标记来隐式覆盖整个构造类型。在这种情况下,任何查询,例如

This rule applies to all employees except Bill and it would be a waste of space to mark each individual employee. Rather, one would prefer to set a single marker in the system catalogs to cover the whole constructed type implicitly. In this case, any query, e.g.,

检索 (EMP.age),其中 EMP.name = “Sam”

retrieve (EMP.age) where EMP.name = “Sam”

将在执行之前由查询重写实现更改为

will be altered prior to execution by the query rewrite implementation to

检索 (age = 40),其中 EMP.name = “Sam” 且

retrieve (age = 40) where EMP.name = “Sam” and

 EMP.名称!=比尔”

 EMP.name ! = Bill”

目前,[ 30 ]中描述的 POSTGRES 规则系统(PRS)的大部分都是可操作的,我们希望在接下来的三小节中讨论设计的三个方面,即

At the current time, much of the POSTGRES rules system (PRS) as described in [30] is operational, and there are three aspects of the design which we wish to discuss in the next three subsections, namely,

复杂

complexity

缺乏所需的功能

absence of needed function

效率。

efficiency.

然后,我们以目前正在设计的 POSTGRES 规则系统(PRS II)的第二个版本作为结束。该规则系统在[ 31 ]、[ 32 ]中有更详细的描述。

Then, we close with the second version of the POSTGRES rules system (PRS II) which we are currently designing. This rules system is described in more detail in [31], [32].

III.B  复杂性

III.B  Complexity

PRS 的第一个问题是实施极其复杂。即使对于经验丰富的人来说,也很难解释导致规则唤醒的标记机制。此外,我们中的一些人有一种不安的感觉,认为实施可能不太正确。基本问题可以用上面的 Joe-Fred 例子来说明。首先,只要 Fred 的工资发生变化,规则就必须被唤醒并运行。这就需要在弗雷德的工资上打上一种标记。然而,如果 Fred 被赋予了一个新名字,比如 Bill,则必须删除并重新安装该规则。这需要在 Fred 的名字上添加第二种标记。此外,允许对 Joe 的工资进行任何更新也是不合适的;因此,该字段需要第三种标记。此外,如果弗雷德还没有被雇用,那么该规则必须在插入他的记录时生效。这需要在员工姓名索引中放置一个标记。为了支持处理值范围的规则,例如,

The first problem with PRS is that the implementation is exceedingly complex. It is difficult to explain the marking mechanisms that cause rule wakeup even to a sophisticated person. Moreover, some of us have an uneasy feeling that the implementation may not be quite correct. The fundamental problem can be illustrated using the Joe–Fred example above. First, the rule must be awakened and run whenever Fred’s salary changes. This requires that one kind of marker be placed on the salary of Fred. However, if Fred is given a new name, say Bill, then the rule must be deleted and reinstalled. This requires a second kind of marker on the name of Fred. Additionally, it is inappropriate to allow any update to Joe’s salary; hence, a third kind of marker is required on that field. Furthermore, if Fred has not yet been hired, then the rule must take effect on the insertion of his record. This requires a marker to be placed in the index for employee names. To support rules that deal with ranges of values, for example,

始终更换 EMP(年龄 = 40)

always replace EMP (age = 40)

其中 EMP.salary > 50000 且 EMP.salary < 60000

where EMP.salary > 50000 and EMP.salary < 60000

我们要求在索引中放置两个“存根”标记来表示扫描的结束。另外,每个介入的索引记录也必须被标记。确保正确安装所有标记并在记录访问和更新发生时采取适当的操作一直是一项挑战。

we require that two “stub” markers be placed in the index to denote the ends of the scan. In addition, each intervening index record must also be marked. Ensuring that all markers are correctly installed and appropriate actions taken when record accesses and updates occur has been a challenge.

造成严重复杂性的另一个原因是必须处理优先事项。例如,考虑第二条规则:

Another source of substantial complexity is the necessity to deal with priorities. For example, consider a second rule:

始终替换 EMP(年龄 = 50),其中 EMP.dept =

always replace EMP (age = 50) where EMP.dept =

“鞋”

“shoe”

在这种情况下,一名高薪鞋类部门的员工将被赋予两个不同的年龄。为了缓解这种情况,可以给予第二条规则更高的优先级,例如,

In this case, a highly paid shoe department employee would be given two different ages. To alleviate this situation, the second rule could be given a higher priority, e.g.,

始终替换 EMP(年龄 = 50),其中 EMP.dept =

always replace EMP (age = 50) where EMP.dept =

 “鞋”

 “shoe”

优先级 = 1

priority = 1

规则的默认优先级为0;因此,第一条规则将高薪员工的年龄设定为 40 岁,除非他们在制鞋部门,在这种情况下,第二条规则将把他们的年龄设定为 50 岁。当然,优先级会增加规则系统的复杂性。例如,如果删除上面的第二条规则,则必须唤醒第一条规则来纠正鞋业部门员工的年龄。

The default priority for rules is 0; hence, the first rule would set the age of highly paid employees to 40 unless they were in the shoe department, in which case their age would be set to 50 by the second rule. Priorities, of course, add complications to the rules system. For example, if the second rule above is deleted, then the first rule must be awakened to correct the ages of employees in the shoe department.

复杂性的另一个方面是我们决定支持规则的早期和晚期评估。考虑一下示例规则:乔的薪水与弗雷德相同。该规则可以在 Fred 获得薪资调整时被唤醒,也可以延迟激活,直到用户请求 Joe 的薪资。在第二种情况下,激活可以尽可能延迟,我们将这种情况称为晚期评估,而前一种情况称为早期评估。这种灵活性也导致了相当大的额外复杂性。例如,某些规则不能延迟激活。如果对员工的工资进行索引,则必须尽早激活将 Joe 的工资设置为 Fred 的规则,因为索引必须保持正确。而且,这是不可能的让早期规则读取后期规则写入的数据。因此,必须施加额外的限制。

Another aspect of complexity is our decision to support both early and late evaluation of rules. Consider the example rule that Joe makes the same salary as Fred. This rule can be awakened when Fred gets a salary adjustment, or activation can be delayed until a user requests the salary of Joe. Activation can be delayed as long as possible in the second case, and we term this late evaluation while the former case is termed early evaluation. This flexibility also results in substantial extra complexity. For example, certain rules cannot be activated late. If salaries of employees are indexed, then the rule that sets Joe’s salary to that of Fred must be activated early because the index must be kept correct. Moreover, it is impossible for an early rule to read data that are written by a late rule. Hence, additional restrictions must be imposed.

确保 PRS 正确需要花费无数时间的讨论和相当大的实施复杂性。最重要的是,对用户来说干净且简单的规则系统的实现实际上是极其复杂和棘手的。我们个人的感觉是,我们应该采取更温和的规则体系。

Getting PRS correct has entailed uncounted hours of discussion and considerable implementation complexity. The bottom line is that the implementation of a rule system that is clean and simple to the user is, in fact, extremely complex and tricky. Our personal feeling is that we should have embarked on a more modest rules system.

III.C  缺乏所需功能

III.C  Absence of Needed Function

有用的规则系统的定义是能够在一个集成系统中至少处理以下所有问题:

The definition of a useful rules system is one that can handle at least all of the following problems in one integrated system:

支持观点

support for views

保护

protection

参照完整性

referential integrity

其他完整性约束。

other integrity constraints.

我们在本节中重点关注对观点的支持。规则系统的查询重写实现应该能够将对视图的查询转换为对真实对象的查询。此外,视图的更新应该类似地映射到真实对象的更新。

We focus in this section on support for views. The query rewrite implementation of a rules system should be able to translate queries on views into queries on real objects. In addition, updates to views should be similarly mapped to updates on real objects.

PRS 可以执行多种特殊情况的视图支持,例如物化视图。考虑以下视图定义:

There are various special cases of view support that can be performed by PRS, for example materialized views. Consider the following view definitions:

定义视图 SHOE-EMP (名称 = EMP.name, 年龄 =

define view SHOE-EMP (name = EMP.name, age =

 EMP.年龄、工资 = EMP.工资)

 EMP.age, salary = EMP.salary)

其中 EMP.dept =“鞋”

where EMP.dept = “shoe”

以下两条 PRS 规则指定此视图的具体化:

The following two PRS rules specify a materialization of this view:

始终附加到 SHOE-EMP(名称 = EMP.name,

always append to SHOE-EMP (name = EMP.name,

 工资 = EMP.salary) 其中 EMP.dept = “鞋子”

 salary = EMP.salary) where  EMP.dept = “shoe”

始终删除 SHOE-EMP,其中 SHOE-EMP.name

always delete SHOE-EMP where SHOE-EMP.name

 NOTIN {EMP.name 其中 EMP.dept =“鞋”)

 NOTIN {EMP.name where EMP.dept = “shoe”)

在这种情况下,SHOE-EMP 将始终包含鞋类部门员工的正确物化,并且可以将查询定向到该物化。

In this case, SHOE-EMP will always contain a correct materialization of the shoe department employees, and queries can be directed to this materialization.

然而,似乎没有办法支持对未实现的视图进行更新。我们中的一个人花了无数的时间试图支持这一点通过 PRS 运行但失败。因此,无法支持传统观点提供的操作是 PRS 的主要弱点。

However, there seemed to be no way to support updates on views that are not materialized. One of us has spent countless hours attempting to support this function through PRS and failed. Hence, inability to support operations provided by conventional views is a major weakness of PRS.

III.D  实施效率

III.D  Implementation Efficiency

当前的 POSTGRES 实现使用各个字段上的标记来支持规则激活。唯一支持的升级是将字段级别标记的集合转换为整个构造类型上的单个标记。因此,如果规则涵盖单个实例,例如,

The current POSTGRES implementation uses markers on individual fields to support rule activation. The only escalation supported is to convert a collection of field level markers to a single marker on the entire constructed type. Consequently, if a rule covers a single instance, e.g.,

始终替换 EMP (工资 = 1000),其中 EMP.name =

always replace EMP (salary = 1000) where EMP.name =

 “山姆”

 “Sam”

那么总共会设置三个标记,一个在索引,一个在工资字段,一个在姓名字段。每个标记由以下部分组成

then a total of three markers will be set, one in the index, one on the salary field, and one on the name field. Each marker is composed of

规则编号

rule-id

6字节

6 bytes

优先事项

priority

1字节

1 byte

标记型

marker-type

1 字节。

1 byte.

因此,该规则的标记开销为 24 字节。现在考虑一个更复杂的规则:

Consequently, the marker overhead for the rule is 24 bytes. Now consider a more complex rule:

始终替换 EMP(工资 = 1000),其中 EMP.dept =

always replace EMP (salary = 1000) where EMP.dept =

 “鞋”

 “shoe”

如果制鞋部门有 1000 名员工,则标记将消耗 24K 字节的开销。唯一的其他选择是升级为整个构造类型上的标记,在这种情况下,如果读取或写入任何工资(而不仅仅是鞋类部门的员工),则该规则将被激活。这将是一个开销密集的选择。因此,对于覆盖许多实例但不是所有实例的很大一部分的规则,POSTGRES 实现不会非常节省空间。

If 1000 employees work in the shoe department, then 24K bytes of overhead will be consumed in markers. The only other option is to escalate to a marker on the entire constructed type, in which case the rule will be activated if any salary is read or written and not just for employees in the shoe department. This will be an overhead intensive option. Hence, for rules which cover many instances but not a significant fraction of all instances, the POSTGRES implementation will not be very space efficient.

我们正在考虑解决这个问题的几种方法。首先,我们推广了B +树来有效地存储区间数据和点数据。这种“分段B + 树”是另一篇论文的主题 [ 16 ]。这将消除主要形式的访问方法的索引中的空间开销。其次,为了降低数据记录的开销,我们可能会在物理块级别以及实例和构造类型级别实现标记。目前正在研究适当的额外粒度。

We are considering several solutions to this problem. First, we have generalized B+-trees to efficiently store interval data as well as point data. Such “segmented B+-trees” are the subject of a separate paper [16]. This will remove the space overhead in the index for the dominant form of access method. Second, to lower the overhead on data records, we will probably implement markers at the physical block level as well as at the instance and constructed type levels. The appropriate extra granularities are currently under investigation.

III.E  第二个 POSTGRES 规则系统

III.E  The Second POSTGRES Rules System

由于当前规则范式无法支持视图,并且在较小程度上实现了基本的复杂性,因此我们正在转换为第二个 POSTGRES 规则系统(PRS II)。该规则系统与第一个实现有很多共同点,但返回到传统的生产规则范例以获得足够的控制来正确执行视图更新。本节概述了我们的想法,完整的提案出现在[ 32 ]中。

Because of the inability of the current rules paradigm to support views and to a lesser extent the fundamental complexity of the implementation, we are converting to a second POSTGRES rules system (PRS II). This rules system has much in common with the first implementation, but returns to the traditional production rule paradigm to obtain sufficient control to perform view updates correctly. This section outlines our thinking, and a complete proposal appears in [32].

我们在 PRS II 中使用的产生式规则语法具有以下形式

The production rule syntax we are using in PRS II has the form

ON 事件 TO 对象

ON event TO object

 WHERE POSTQUEL-资格

 WHERE POSTQUEL-qualification

然后执行 POSTSQUEL 命令

THEN DO POSTQUEL-command(s)

这里,事件是检索、替换、删除、附加、更新、新(即,替换或附加)或旧(即,删除或替换)。此外,object 要么是构造类型的名称,要么是constructed-type.column。POSTQUEL-资格是正常资格,没有添加或更改。最后,POSTQUEL-commands 是一组 POSTQUEL 命令,具有以下两个更改:

Here, event is RETRIEVE, REPLACE, DELETE, APPEND, UPDATE, NEW (i.e., replace or append) or old (i.e., delete or replace). Moreover, object is either the name of a constructed type or constructed-type.column. POSTQUEL-qualification is a normal qualification, with no additions or changes. Lastly, POSTQUEL-commands is a set of POSTQUEL commands with the following two changes:

可以出现 NEW、OLD 或 CURRENT,而不是显示

NEW, OLD, or CURRENT can appear instead of the

 任何属性前面的构造类型的名称

 name of a constructed type in front of any attribute

拒绝(目标列表)被添加为新的 POSTQUEL 命令

refuse (target-list) is added as a new POSTQUEL command

在这种表示法中,我们将“Fred-Joe”规则指定为

In this notation, we would specify the “Fred-Joe” rule as

在 NEW EMP.salary 上,其中 EMP.name =“Fred”

on NEW EMP.salary where EMP.name = “Fred”

然后做

then do

 替换E(工资 = CURRENT.salary)

 replace E (salary = CURRENT.salary)

在 EMP 中 使用E

 using E in EMP

 其中E .name = “Joe”

 where E.name = “Joe”

在 NEW EMP.salary 上,其中 EMP.name =“Joe”

on NEW EMP.salary where EMP.name = “Joe”

然后做

then do

 拒绝

 refuse

请注意,PRS II 的功能不如“始终”系统强大,因为 Fred–Joe 规则需要两个规范,而不是一个。

Notice, that PRS II is less powerful than the “always” system because the Fred–Joe rule requires two specifications instead of one.

PRS II 同时具有查询重写实现和触发器实现,使用哪一个实现是一种优化决策,如[ 32 ]中所述。例如,考虑规则

PRS II has both a query rewrite implementation and a trigger implementation, and it is an optimization decision which one to use as noted in [32]. For example, consider the rule

检索至 SHOE-EMP

on RETRIEVE to SHOE-EMP

然后做

then do

检索(EMP.姓名、EMP.年龄、EMP.工资)

retrieve (EMP.name, EMP.age, EMP.salary)

 其中 EMP.dept =“鞋”

 where EMP.dept = “shoe”

任何利用此类规则的查询,例如

Any query utilizing such a rule, e.g.,

检索 (SHOE-EMP.name) 其中 SHOE-EMP.age

retrieve (SHOE-EMP.name) where SHOE-EMP.age

 <40

 <40

将由重写实现处理为

would be processed by the rewrite implementation to

检索 (EMP.name),其中 EMP.age < 40 且

retrieve (EMP.name) where EMP.age < 40 and

 EMP.dept =“鞋”

 EMP.dept = “shoe”

可以看出,这与关系视图处理技术中执行的查询修改相同[ 24 ]。该规则也可以由触发系统处理,在这种情况下,该规则将迭代地具体化 SHOE-EMP 中的记录。

As can be seen, this is identical to the query modification performed in relational view processing techniques [24]. This rule could also be processed by the triggering system, in which case the rule would materialize the records in SHOE-EMP iteratively.

此外,支持附加功能也很简单,例如允许在视图定义中进行多个查询。通过缓存上述规则的操作部分,即在用户请求评估之前执行命令,可以有效地支持物化视图。这对应于将规则移至早期评估。最后,支持部分具体化、部分指定为过程的视图以及涉及递归的视图似乎相当简单。在[ 32 ]中,我们介绍了这些扩展的详细信息。

Moreover, it is straightforward to support additional functionality, such as allowing multiple queries in the definition of a view. Supporting materialized views can be efficiently done by caching the action part of the above rule, i.e., executing the command before a user requests evaluation. This corresponds to moving the rule to early evaluation. Lastly, supporting views that are partly materialized and partly specified as procedures as well as views that involve recursion appears fairly simple. In [32], we present details on these extensions.

请考虑以下支持 SHOE-EMP 更新的规则集合:

Consider the following collection of rules that support updates to SHOE-EMP:

关于新鞋-EMP

on NEW SHOE-EMP

然后做

then do

 附加到 EMP (姓名 = NEW.name, 薪水 =

 append to EMP (name = NEW.name, salary =

  新.工资)

  NEW.salary)

旧鞋-EMP

on OLD SHOE-EMP

然后做

then do

 删除 EMP,其中 EMP.name = OLD.name 且

 delete EMP where EMP.name = OLD.name and

 EMP.工资 = OLD.工资

 EMP.salary = OLD.salary

更新至 SHOE-EMP

on update to SHOE-EMP

然后做

then do

 替换 EMP(姓名=NEW.姓名,工资=

 replace EMP (name = NEW.name, salary =

  新.工资)

  NEW.salary)

 其中 EMP.name = NEW.name

 where EMP.name = NEW.name

如果这些规则由触发器实现处理,则更新 SHOE-EMP,例如,

If these rules are processed by the trigger implementation, then an update to SHOE-EMP, e.g.,

替换 SHOE-EMP(工资 = 1000),其中 SHOE-

replace SHOE-EMP (salary = 1000) where SHOE-

 EMP.name =“迈克”

 EMP.name = “Mike”

将正常处理,直到生成一个集合

will be processed normally until it generates a collection of

[新记录、旧记录]

[new-record, old-record]

对。此时可以激活触发系统以对底层构造类型进行适当的更新。此外,如果用户希望非标准视图更新语义,他可以通过改变上述规则的动作部分来执行他想要的任何特定动作。

pairs. At this point the triggering system can be activated to make appropriate updates to underlying constructed types. Moreover, if a user wishes nonstandard view update semantics, he can perform any particular actions he desires by changing the action part of the above rules.

PRS II 从而允许用户使用规则系统来定义视图检索和更新的语义。事实上,我们希望构建一个编译器,将更高级别的视图符号转换为所需的 PRS II 规则集合。此外,PRS II 保留了第一个规则系统的所有功能,因此可以轻松表达保护、警报器、完整性约束和任意触发器。唯一的缺点是 PRS II 需要两个规则来执行可表示为单个 PRS 规则的许多任务。为了克服这个缺点,除了 PRS II 语法之外,我们可能会继续支持 PRS 语法,并将 PRS 编译为 PRS II 支持。

PRS II thereby allows a user to use the rules system to define semantics for retrievals and updates to views. In fact, we expect to build a compiler that will convert a higher level view notation into the needed collection of PRS II rules. In addition, PRS II retains all the functionality of the first rules system, so protection, alerters, integrity constraints, and arbitrary triggers are readily expressed. The only disadvantage is that PRS II requires two rules to perform many tasks expressible as a single PRS rule. To overcome this disadvantage, we will likely continue to support the PRS syntax in addition to the PRS II syntax and compile PRS into PRS II support.

PRS II 可以通过我们为 PRS 的查询重写实现提出的相同实现来支持,即在系统目录中标记实例。此外,查询重写算法与第一个实现几乎相同。触发系统可由与 PRS 中相同的实例标记支持。事实上,实现要简单一些,因为不需要几种类型的标记。由于 PRS II 的实施与我们最初的规则系统非常相似,我们预计在不久的将来完成转换。

PRS II can be supported by the same implementation that we proposed for the query rewrite implementation of PRS, namely marking instances in the system catalogs. Moreover, the query rewrite algorithm is nearly the same as in the first implementation. The triggering system can be supported by the same instance markers as in PRS. In fact, the implementation is bit simpler because a couple of the types of markers are not required. Because the implementation of PRS II is so similar to our initial rules system, we expect to have the conversion completed in the near future.

静脉输液  储存系统

IV  Storage System

四、A  简介

IV.A  Introduction

在考虑 POSTGRES 存储系统时,我们怀着传教士般的热情去做一些不同的事情。当前所有商业系统都使用带有预写日志(WAL)的存储管理器,我们认为这项技术很好理解。此外,1970 年代的原始 INGRES 原型使用​​了类似的存储管理器,我们不想进行其他实现。

When considering the POSTGRES storage system, we were guided by a missionary zeal to do something different. All current commercial systems use a storage manager with a write-ahead log (WAL), and we felt that this technology was well understood. Moreover, the original INGRES prototype from the 1970’s used a similar storage manager, and we had no desire to do another implementation.

因此,我们抓住了实施“无覆盖”存储管理器的想法。使用此技术,每当发生更新时,旧记录都会保留在数据库中,并用于通常由预写日志执行的目的。因此,POSTGRES 没有传统意义上的日志。相反,POSTGRES 日志只是每个事务 2 位,指示每个事务是已提交、已中止还是正在进行。

Hence, we seized on the idea of implementing a “no-overwrite” storage manager. Using this technique, the old record remains in the database whenever an update occurs, and serves the purpose normally performed by a write-ahead log. Consequently, POSTGRES has no log in the conventional sense of the term. Instead the POSTGRES log is simply 2 bits per transaction indicating whether each transaction committed, aborted, or is in progress.

在无覆盖系统中可以利用两个非常好的功能。首先,中止事务可以是即时的,因为不需要处理日志来撤消更新的影响;以前的记录可以在数据库中轻松获得。更一般地,为了从崩溃中恢复,必须中止崩溃时正在进行的所有事务。在 POSTGRES 中,该过程可以有效地瞬时完成。

Two very nice features can be exploited in a no-overwrite system. First, aborting a transaction can be instantaneous because one does not need to process the log undoing the effects of updates; the previous records are readily available in the database. More generally, to recover from a crash, one must abort all the transactions in progress at the time of the crash. This process can be effectively instantaneous in POSTGRES.

无覆盖存储管理器的第二个好处是时间旅行的可能性。如前所述,用户可以提出历史查询,POSTGRES 将自动返回在正确时间有效的记录中的信息。

The second benefit of a no-overwrite storage manager is the possibility of time travel. As noted earlier, a user can ask a historical query and POSTGRES will automatically return information from the record valid at the correct time.

该存储管理器应该与传统的存储管理器形成对比,在传统的存储管理器中,先前的记录被新的记录覆盖。在这种情况下,需要一个预写日志来维护每条记录的先前版本。不存在时间旅行的可能性,因为日志的格式不同,因此无法查询。此外,当发生崩溃时,必须通过处理日志来撤消任何部分完成的事务,将数据库恢复到一致的状态。因此,不可能进行瞬时崩溃恢复。

This storage manager should be contrasted with a conventional one where the previous record is overwritten with a new one. In this case, a write-ahead log is required to maintain the previous version of each record. There is no possibility of time travel because the log cannot be queried since it is in a different format. Moreover, the database must be restored to a consistent state when a crash occurs by processing the log to undo any partially completed transactions. Hence, there is no possibility of instantaneous crash recovery.

显然,如果能够以相当的性能实现,无覆盖存储管理器优于传统存储管理器。[ 28 ]中有一个简短的挥手论证,声称情况可能如此。我们认为,争论的焦点在于稳定的存在主存储器。在缺乏稳定内存的情况下,非覆盖存储管理器必须在提交时强制将事务写入的所有页面写入磁盘。这是必需的,因为已提交事务的效果必须是持久的,以防发生崩溃和主内存丢失。另一方面,传统的数据管理器只需要在提交时强制将事务更新的日志页写入磁盘。即使日志页与数据页一样多(极不可能发生),传统存储管理器也会对日志执行顺序 I/O,而非覆盖存储管理器则会执行随机 I/O。由于顺序 I/O 比随机 I/O 快得多,因此不可覆盖解决方案肯定会提供较差的性能。

Clearly a no-overwrite storage manager is superior to a conventional one if it can be implemented at comparable performance. There is a brief hand-wave argument in [28] that alleges this might be the case. In our opinion, the argument hinges around the existence of stable main memory. In the absence of stable memory, a no-overwrite storage manager must force to disk at commit time all pages written by a transaction. This is required because the effects of a committed transaction must be durable in case a crash occurs and main memory is lost. A conventional data manager, on the other hand, need only force to disk at commit time the log pages for the transaction’s updates. Even if there are as many log pages as data pages (a highly unlikely occurrence), the conventional storage manager is doing sequential I/O to the log while a no-overwrite storage manager is doing random I/O. Since sequential I/O is substantially faster than random I/O, the no-overwrite solution is guaranteed to offer worse performance.

但是,如果存在稳定的主内存,则两种解决方案都不必强制将页面写入磁盘。在这种环境下,性能应该具有可比性。因此,在主存储器稳定的情况下,非覆盖解决方案似乎具有竞争力。作为计算机制造商提供某种形式的稳定主存储器,无覆盖解决方案可能成为可行的存储选择。

However, if stable main memory is present, then neither solution must force pages to disk. In this environment, performance should be comparable. Hence, with stable main memory it appears that a no-overwrite solution is competitive. As computer manufacturers offer some form of stable main memory, a no-overwrite solution may become a viable storage option.

在设计 POSTGRES 存储系统时,我们遵循两个哲学前提。首先,我们决定明确区分当前数据和历史数据。我们预计访问模式将高度偏向当前记录。此外,对存档的查询可能看起来与访问当前数据的查询非常不同。由于这两个原因,POSTGRES 维护两种不同的物理记录集合,一种用于当前数据,另一种用于历史数据,每个记录都有自己的索引。

In designing the POSTGRES storage system, we were guided by two philosophical premises. First, we decided to make a clear distinction between current data and historical data. We expected access patterns to be highly skewed toward current records. In addition, queries to the archive might look very different from those accessing current data. For both reasons, POSTGRES maintains two different physical collections of records, one for the current data and one for historical data, each with its own indexes.

其次,我们的设计假设存在一个可随机寻址的存档设备,历史记录放置在该设备上。我们对该档案的直观模型是光盘。我们的设计特意与一次写入多次读取 (WORM) 方向的存档保持一致。这是当今市场上许多光盘的特点。

Second, our design assumes the existence of a randomly addressable archive device on which historical records are placed. Our intuitive model for this archive is an optical disk. Our design was purposely made consistent with an archive that has a write-once-read-many (WORM) orientation. This characterizes many of the optical disks on the market today.

在下一小节中,我们指出 POSTGRES 设计的两个问题。然后,在第 1.5.3 节中,我们对存储管理器进行了额外的评论。

In the next subsection, we indicate two problems with the POSTGRES design. Then, in Section 1.5.3 we make additional comments on the storage manager.

IV.B   POSTGRES 设计中的问题

IV.B  Problems in the POSTGRES Design

我们的设计至少存在两个问题。首先,重负载下不稳定。异步恶魔(称为真空吸尘器)负责将历史记录从保存当前记录的磁盘结构移动到保留历史记录的存档中。在正常情况下,每个构造类型的磁盘部分(比如说)仅为该构造类型的最小可能大小的 1.1 倍。当然,吸尘器会消耗后台运行的CPU和I/O资源来实现这一目标。但是,如果 POSTGRES 数据库上的负载增加,则吸尘器可能无法运行。在这种情况下,构造类型的磁盘部分将会增加,并且性能会受到影响,因为执行引擎在(可能是频繁的)处理当前数据库的查询期间必须读取磁盘上的历史记录。因此,性能将随数据库磁盘部分的过大大小成比例地降低。随着负载的增加,吸尘器获得的资源会减少,并且性能会随着磁盘数据库大小的增加而降低。这最终将导致 POSTGRES 数据库崩溃。

There are at least two problems with our design. First, it is unstable under heavy load. An asynchronous demon, known as vacuum cleaner, is responsible for moving historical records from the magnetic disk structure holding the current records to the archive where historical records remain. Under normal circumstances, the magnetic disk portion of each constructed type is (say) only 1.1 times the minimum possible size of the constructed type. Of course, the vacuum cleaner consumes CPU and I/O resources running in background achieving this goal. However, if the load on a POSTGRES database increases, then the vacuum cleaner may not get to run. In this case, the magnetic disk portion of a constructed type will increase, and performance will suffer because the execution engine must read historical records on the magnetic disk during the (presumably frequent) processing of queries to the current database. As a result, performance will degrade proportionally to the excess size of the magnetic disk portion of the database. As load increases, the vacuum cleaner gets less resources, and performance degrades as the size of the magnetic disk database increases. This will ultimately result in a POSTGRES database going into meltdown.

显然,如果可能的话,吸尘器应该在后台运行,以便它可以在凌晨2:00 消耗资源。当几乎没有其他活动时。但是,如果系统上存在持续的重负载,则必须以与其他任务相同的优先级来调度吸尘器,因此不会出现上述不稳定情况。最重要的是,调度吸尘器是一个棘手的优化问题。

Obviously, the vacuum cleaner should be run in background if possible so that it can consume resources at 2:00 A.M. when there is little other activity. However, if there is consistent heavy load on a system, then the vacuum cleaner must be scheduled at the same priority as other tasks, so the above instability does not occur. The bottom line is that scheduling the vacuum cleaner is a tricky optimization problem.

我们要说的第二点是,未来的档案系统很可能是读/写的,并且可重写光盘已经出现在市场上。因此,我们没有理由将自己局限于 WORM 技术。因此,某些 POSTGRES 假设是不必要的,例如要求任何构造类型的当前部分位于磁盘上。

The second comment which we wish to make is that future archive systems are likely to be read/write, and rewritable optical disks have already appeared on the market. Consequently, there is no reason for us to have restricted ourselves to WORM technology. Certain POSTGRES assumptions were therefore unnecessary, such as requiring the current portion of any constructed type to be on magnetic disk.

IV.C  其他意见

IV.C  Other Comments

历史索引通常位于一个组合键上,该组合键由时间范围以及记录本身的一个或多个键组成。这种二维索引可以使用R树 [ 15 ]、R + 树 [ 14 ] 技术或某种新的方式来存储。我们对找到索引时间范围的好方法并不是特别满意,我们鼓励在这一领域开展更多工作。一种可能的方法是我们正在研究的分段 R 树[ 16 ]。

Historical indexes will usually be on a combined key consisting of a time range together with one or more keys from the record itself. Such two-dimensional indexes can be stored using the technology of R-trees [15], R+-trees [14], or perhaps in some new way. We are not particularly comfortable that good ways to index time ranges have been found, and we encourage additional work in this area. A possible approach is segmented R-trees which we are studying [16].

另一条评论涉及 POSTGRES 对时间旅行的支持。有许多任务很难用我们的机制来表达。例如,在 POSTQUEL 中查找 Sam 的工资从 $5000 增加到 $6000 的时间的查询非常棘手。

Another comment concerns POSTGRES support for time travel. There are many tasks that are very difficult to express with our mechanisms. For example, the query to find the time at which Sam’s salary increased from $5000 to $6000 is very tricky in POSTQUEL.

最后一点是,时间旅行可以通过使用预写日志的传统事务系统来实现。例如,对于需要时间旅行的每个物理构造类型,只需要有一个“存档”构造类型。更新记录时,其先前的值会以适当的时间戳写入存档中。如果事务未能提交,则使用传统日志展开此归档插入和相应的记录更新。这样的实施很可能会带来巨大的好处,我们也许应该考虑这种可能性。在制定存储系统决策时,我们以传教士的热情为指导,去做一些不同于传统预写日志方案的事情。因此,我们可能忽略了其他有趣的选择。

A last comment is that time travel can be implemented with a conventional transaction system using a write ahead log. For example, one need only have an “archive” constructed type for each physical constructed type for which time travel is desired. When a record is updated, its previous value is written in the archive with the appropriate timestamps. If the transaction fails to commit, this archive insert and the corresponding record update is unwound using a conventional log. Such an implementation may well have substantial benefits, and we should have probably considered such a possibility. In making storage system decisions, we were guided by a missionary zeal to do something different than a conventional write ahead log scheme. Hence, we may have overlooked other intriguing options.

  、POSTGRES 实现

V  The POSTGRES Implementation

弗吉尼亚州简介

V.A Introduction

POSTGRES 包含一个相当传统的解析器、查询优化器和执行引擎。实施的两个方面值得特别提及,

POSTGRES contains a fairly conventional parser, query optimizer, and execution engine. Two aspects of the implementation deserve special mention,

动态加载和进程结构

dynamic loading and the process structure

实现语言的选择

choice of implementation language

我们依次讨论每一个。

and we discuss each in turn.

VB  动态加载和进程结构

V.B  Dynamic Loading and Process Structure

POSTGRES 假定数据类型、运算符和函数可以动态添加和删除,即在系统执行时动态添加和删除。此外,我们设计的系统可以容纳潜在的大量类型和操作员。因此,支持类型实现的用户函数必须动态加载和卸载。因此,POSTGRES 维护当前加载函数的缓存,并动态地将函数移入缓存,然后将它们从缓存中老化。此外,解析器和优化器运行有关类型和运算符的信息的主内存缓存。同样,该缓存必须由 POSTGRES 软件维护。更容易假设所有类型和运算符在 POSTGRES 初始化时都链接到系统中,并且当用户希望添加或删除类型时要求用户重新安装 POSTGRES。此外,原型软件的用户运行的系统并不是无法重新启动的系统。因此,该功能不是必需的。

POSTGRES assumes that data types, operators, and functions can be added and subtracted dynamically, i.e., while the system is executing. Moreover, we have designed the system so that it can accommodate a potentially very large number of types and operators. Consequently, the user functions that support the implementation of a type must be dynamically loaded and unloaded. Hence, POSTGRES maintains a cache of currently loaded functions and dynamically moves functions into the cache and then ages them out of the cache. Moreover, the parser and optimizer run off of a main memory cache of information about types and operators. Again this cache must be maintained by POSTGRES software. It would have been much easier to assume that all types and operators were linked into the system at POSTGRES initialization time and have required a user to reinstall POSTGRES when he wished to add or drop types. Moreover, users of prototype software are not running systems which cannot go down for rebooting. Hence, the function is not essential.

其次,规则系统给设计带来了极大的复杂性。用户可以添加规则,例如

Second, the rules system forces significant complexity on the design. A user can add a rule such as

总是检索(EMP.salary)

always retrieve (EMP.salary)

其中 EMP.name = “Joe”

where EMP.name = “Joe”

在这种情况下,他的申请流程希望收到有关 Joe 的任何薪资调整的通知。考虑给乔加薪的第二个用户。实际进行调整的 POSTGRES 进程会注意到工资字段上已放置了一个标记。但是,为了提醒第一个用户,必须发生以下四种情况之一。

In this case, his application process wishes to be notified of any salary adjustment to Joe. Consider a second user who gives Joe a raise. The POSTGRES process that actually does the adjustment will notice that a marker has been placed on the salary field. However, in order to alert the first user, one of four things must happen.

1. POSTGRES 可以设计为单个服务器进程。在这种情况下,在当前进程内,可以简单地激活第一用户的查询。然而,这种设计与在共享内存多处理器上运行不兼容,而共享内存多处理器需要所谓的多服务器。因此,这个设计被放弃了。

1.  POSTGRES could be designed as a single server process. In this case, within the current process the first user’s query could simply be activated. However, such a design is incompatible with running on a shared memory multiprocessor, where a so-called multiserver is required. Hence, this design was discarded.

2. 第二个用户的 POSTGRES 进程可以运行第一个用户的查询,然后连接到他的应用程序进程以传递结果。这要求对应用程序进程进行编码以期望来自随机其他进程的通信。我们认为这太难了,不是一个合理的解决方案。

2.  The POSTGRES process for the second user could run the first user’s query and then connect to his application process to deliver results. This requires that an application process be coded to expect communication from random other processes. We felt this was too difficult to be a reasonable solution.

3. 第二个用户的 POSTGRES 进程可以连接到第一个用户的 POSTGRES 的输入套接字并传送要运行的查询。第一个 POSTGRES 将运行查询,然后将结果发送给用户。这需要在多个独立命令流之间仔细同步输入套接字。此外,它还需要第二个 POSTGRES 知道第一个用户的规则正在运行的门户名称。

3. The POSTGRES process for the second user could connect to the input socket for the first user’s POSTGRES and deliver the query to be run. The first POSTGRES would run the query and then send results to the user. This would require careful synchronization of the input socket among multiple independent command streams. Moreover, it would require the second POSTGRES to know the portal name on which the first user’s rule was running.

4. 第二个用户的 POSTGRES 进程可以向称为POSTMASTER 的特殊进程发出警报。该进程将依次向第一个用户发出警报,其中将运行查询并将结果传递给应用程序进程。

4.  The POSTGRES process for the second user could alert a special process called the POSTMASTER. This process would in turn alert the process for the first user where the query would be run and the results delivered to the application process.

我们采用了第四种设计,这是我们认为唯一实用的设计。然而,我们因此构建了一个每个人都必须进行沟通的流程。如果 POSTMASTER 崩溃,则必须重新启动整个 POSTGRES 环境。这是一个障碍,但我们想不出更好的解决方案。而且,还有一系列系统恶魔,包括上面提到的吸尘器,都需要有地方运行。在 POSTGRES 中,它们作为由 POSTMASTER 管理的子进程运行。

We have adopted the fourth design as the only one we thought was practical. However, we have thereby constructed a process through which everybody must channel communications. If the POSTMASTER crashes, then the whole POSTGRES environment must be restarted. This is a handicap, but we could think of no better solution. Moreover, there are a collection of system demons, including the vacuum cleaner mentioned above, which need a place to run. In POSTGRES, they are run as subprocesses managed by the POSTMASTER.

我们设计的最后一个方面涉及操作系统进程结构。目前,POSTGRES 作为每个活动用户的一个进程运行。这样做是为了让系统尽快运行的权宜之计。我们计划将 POSTGRES 转换为使用我们正在使用的操作系统中可用的轻量级进程。其中包括用于 Sequent Symmetry 的 PRESTO 和 Sun/OS 第 4 版中的线程。

A last aspect of our design concerns the operating system process structure. Currently, POSTGRES runs as one process for each active user. This was done as an expedient to get a system operational as quickly as possible. We plan on converting POSTGRES to use lightweight processes available in the operating systems we are using. These include PRESTO for the Sequent Symmetry and threads in Version 4 of Sun/OS.

使用的VC  编程语言

V.C  Programming Language Used

在项目开始时,我们被迫对编程语言和机器环境做出承诺。这台机器很简单,因为 SUN 工作站在伯克利几乎无处不在,任何其他选择都是非标准的。然而,我们可以自由选择任何编程语言。我们考虑了以下几点:

At the beginning of the project, we were forced to make a commitment to a programming language and machine environment. The machine was an easy one, since SUN workstations were nearly omnipresent at Berkeley, and any other choice would have been nonstandard. However, we were free to choose any language in which to program. We considered the following:

C

C

C++

C++

模块2+

MODULA 2+

语言信息服务程序

LISP

ADA

ADA

短暂聊天。

SMALLTALK.

我们很快就放弃了 SMALLTALK,因为我们觉得它太慢了,而且编译器不容易用于各种平台。我们认为保持广泛分发我们软件的选择是可取的。我们认为 ADA 和 MODULA 2+ 相对于 C++ 的优势有限,并且在 Berkeley 环境中并未得到广泛使用。因此,获得经过预先培训的程序员将是一个问题。最后,我们对使用 C 并不感到兴奋,因为 INGRES 是用 C 编写的,我们渴望选择一种不同的语言,哪怕只是为了做一些不同的事情。在我们开始的时候(10/85),还没有稳定的 C++ 编译器,所以我们没有认真考虑这个选项。

We dismissed SMALLTALK quickly because we felt it was too slow and compilers were not readily available for a wide variety of platforms. We felt it desirable to keep open the option of distributing our software widely. We felt ADA and MODULA 2+ offered limited advantages over C++ and were not widely used in the Berkeley environment. Hence, obtaining pretrained programmers would have been a problem. Lastly, we were not thrilled to use C, since INGRES had been coded in C and we were anxious to choose a different language, if only for the sake of doing something different. At the time we started (10/85), there was not a stable C++ compiler, so we did not seriously consider this option.

通过排除法,我们决定尝试用 LISP 编写 POSTGRES。我们预计用 LISP 编写优化器和推理引擎会特别容易,因为它们大多都是树处理模块。此外,人工智能声称用 LISP 编写的应用程序具有很高的程序员生产力,这对我们来说很有吸引力。

By a process of elimination, we decided to try writing POSTGRES in LISP. We expected that it would be especially easy to write the optimizer and inference engine in LISP, since both are mostly tree processing modules. Moreover, we were seduced by AI claims of high programmer productivity for applications written in LISP.

我们很快意识到系统的某些部分更容易用 C 语言编码,例如缓冲区管理器,它将 8K 页面来回移动到磁盘,并使用修改后的 LRU 算法来控制驻留哪些页面。因此,我们采取的政策是,我们将使用 C 和 LISP 以及 POSTGRES 的代码模块,选择最合适的语言。当版本 1 运行时,它包含大约 17K 行 LISP 和大约 63K 行 C。

We soon realized that parts of the system were more easily coded in C, for example the buffer manager which moves 8K pages back and forth to the disk and uses a modified LRU algorithm to control what pages are resident. Hence, we adopted the policy that we would use both C and LISP and code modules of POSTGRES in whichever language was most appropriate. By the time Version 1 was operational, it contained about 17K lines in LISP and about 63K lines of C.

我们的感觉是,由于多种原因,LISP 的使用是一个可怕的错误。首先,当前的 LISP 环境非常庞大。在 LISP 中运行“无”程序需要大约 3 MB 的地址空间。因此,POSTGRES 的大小超过 4 MB,除了 1 MB 之外,其余都是 LISP 编译器、编辑器和各种其他非必需(甚至是期望)的功能。因此,我们承受着巨大的足迹。其次,DBMS 永远不想在垃圾收集发生时停止。因此,任何响应时间敏感的程序都必须手动分配和释放空间,以便在正常处理期间永远不会发生垃圾收集。因此,我们花费了额外的努力来确保 POSTGRES 不使用 LISP 垃圾收集。因此,LISP 提高程序员生产力的这一方面对我们来说是不可用的。第三,LISP执行速度慢。正如下一节中的性能数据所述,我们的 LISP 代码比同类 C 代码慢两倍多。当然,也有可能我们不是熟练的LISP程序员或者不知道如何优化语言;因此,我们的经验应该适当打折扣。

Our feeling is that the use of LISP has been a terrible mistake for several reasons. First, current LISP environments are very large. To run a “nothing” program in LISP requires about 3 mbytes of address space. Hence, POSTGRES exceeds 4 mbytes in size, all but 1 mbyte is the LISP compiler, editor and assorted other nonrequired (or even desired) functions. Hence, we suffer from a gigantic footprint. Second, a DBMS never wants to stop when garbage collection happens. Any response time sensitive program must therefore allocate and deallocate space manually, so that garbage collection never happens during normal processing. Consequently, we spent extra effort ensuring that LISP garbage collection is not used by POSTGRES. Hence, this aspect of LISP, which improves programmer productivity, was not available to us. Third, LISP execution is slow. As noted in the performance figures in the next section, our LISP code is more than twice as slow as the comparable C code. Of course, it is possible that we are not skilled LISP programmers or do not know how to optimize the language; hence, our experience should be suitably discounted.

然而,这些刺激因素都不是真正的灾难。我们发现调试两种语言的系统极其困难。当然,C 调试器对 LISP 一无所知,而 LISP 调试器对 C 一无所知。因此,我们发现调试 POSTGRES 是一项痛苦且令人沮丧的任务。内存分配错误是最令人痛苦的错误之一,因为 LISP 和 C 具有非常不同的动态存储器模型。当然,优化器和推理引擎确实更容易用 LISP 进行编码。因此,我们在那里节省了一些时间。然而,需要编写大量实用程序代码来将 LISP 数据结构转换为 C 语言,反之亦然,这足以弥补这一点。事实上,我们的评估是,LISP 生产力的提高主要来自良好的编程环境(例如,交互式调试器、良好的工作站工具等),而不是来自语言本身。因此,我们鼓励其他编程语言的实现者仔细研究 LISP 环境并实现更好的想法。

However, none of these irritants was the real disaster. We have found that debugging a two-language system is extremely difficult. The C debugger, of course, knows nothing about LISP while the LISP debugger knows nothing about C. As a result, we have found debugging POSTGRES to be a painful and frustrating task. Memory allocation bugs were among the most painful since LISP and C have very different models of dynamic memory. Of course, it is true that the optimizer and inference engine were easier to code in LISP. Hence, we saved some time there. However, this was more than compensated by the requirement of writing a lot of utility code that would convert LISP data structures into C and vice versa. In fact, our assessment is that the primary productivity increases in LISP come from the nice programming environment (e.g., interactive debugger, nice workstation tools, etc.) and not from the language itself. Hence, we would encourage the implementors of other programming languages to study the LISP environment carefully and implement the better ideas.

因此,我们刚刚完成将 17K 行 LISP 迁移到 C 语言,以避免调试麻烦,其次避免 LISP 中的性能和占用空间问题。我们在 LISP 和两种语言系统方面的经验并不积极,我们会警告其他人不要追随我们的脚步。

As a result we have just finished moving our 17K lines of LISP to C to avoid the debugging hassle and secondarily to avoid the performance and footprint problems in LISP. Our experience with LISP and two-language systems has not been positive, and we would caution others not to follow in our footsteps.

六、  现状及业绩

VI  Status and Performance

目前(1989 年 10 月),POSTGRES 的 LISP-less 版本 1 已经在用户手中一段时间​​了,我们正在从 C 端口中消除最后的错误。此外,我们设计了版本 2 中出现的所有附加功能。版本 1 的特点如下。

At the current time (October 1989) the LISP-less Version 1 of POSTGRES has been in the hands of users for a short time, and we are shaking the last bugs out of the C port. In addition, we have designed all of the additional functionality to appear in Version 2. The characteristics of Version 1 are the following.

1) 查询语言 POSTQUEL 运行,聚合、函数和集合运算符除外。

1) The query language POSTQUEL runs except for aggregates, functions, and set operators.

2) 除 POSTQUEL 类型外,所有对象管理功能均可用。

2) All object management capabilities are operational except POSTQUEL types.

3) 存在对规则的一些支持。具体来说,替换总是命令是可操作的;然而,目前的实现仅支持早期评估,并且仅支持整个列上的标记。

3) Some support for rules exists. Specifically, replace always commands are operational; however, the implementation currently only supports early evaluation and only with markers on whole columns.

4)存储系统完整。然而,我们很快就会在光盘点唱机上交付,因此目前还没有在真正的光盘上实现存档。此外,支持时间旅行的R树尚未实现。

4) The storage system is complete. However, we are taking delivery shortly on an optical disk jukebox, and so the archive is currently not implemented on a real optical disk. Moreover, R-trees to support time travel are not yet implemented.

5) 事务管理运行。

5) Transaction management runs.

重点是让 POSTGRES 中的函数运行。到目前为止,人们只对性能给予了极少的关注。图1显示了 Wisconsin 基准测试中的各种查询,并给出了在 Sun 3/280 上运行的三个系统的结果。所有数字均在非静态系统上运行,因此可能存在显着波动。前两个是 POSTGRES 的 C 和 LISP 版本。这些是功能相同的系统,代码中包含相同的算法。LISP 系统的占用空间约为 4.5 兆字节,而 C 系统约为 1 兆字节。出于比较目的,我们还在第三列中包含了 INGRES 商业版本的性能数据。可以看出,LISP系统比C系统慢几倍。在各种其他基准测试中,我们从未见过 C 系统的速度低于 LISP 系统的两倍。而且,C系统比商业系统慢几倍。我们在 1970 年代中期开发的 INGRES 公共领域版本比商业 INGRES 慢大约两倍。因此,POSTGRES 的速度大约是原始 INGRES 速度的一半。POSTGRES 效率低下,尤其是在检查检索到的记录是否有效的代码中。我们预计后续调整将使我们的性能介于公共域 INGRES 和 RTI INGRES 之间。

The focus has been on getting the function in POSTGRES to run. So far, only minimal attention has been paid in performance. Figure 1 shows assorted queries in the Wisconsin benchmark and gives results for three systems running on a Sun 3/280. All numbers are run on a nonquiescent system so there may be significant fluctuations. The first two are the C and LISP versions of POSTGRES. These are functionally identical systems with the same algorithms embodied in the code. The footprint of the LISP system is about 4.5 megabytes while the C system is about I megabyte. For comparison purposes we also include the performance numbers for the commercial version of INGRES in the third column. As can be seen, the LISP system is several times slower than the C system. In various other benchmarks, we have never seen the C system less than twice as fast as the LISP system. Moreover, the C system is several times slower than a commercial system. The public domain version of INGRES that we worked on the mid 1970’s is about a factor of two slower than commercial INGRES. Hence, it appears that POSTGRES is about one-half the speed of the original INGRES. There are substantial inefficiencies in POSTGRES, especially in the code which checks that a retrieved record is valid. We expect that subsequent tuning will get us somewhere in between the performance of public domain INGRES and RTI INGRES.

图像

图 1   INGRES 和 POSTGRES 的比较(每个查询的时间以秒为单位列出)。

Figure 1  Comparison of INGRES and POSTGRES (times are listed in seconds per query).

七、  结论

VII  Conclusions

在本节中,我们总结了我们对 POSTGRES 设计的某些方面的看法。首先,我们对 POSTGRES 数据模型的复杂性感到不安。第二节中的评论都包含使其变得更加复杂的建议。此外,其他研究团队倾向于构建更复杂的数据模型,例如 EXTRA [ 9]。因此,诸如引用完整性之类的简单概念在现有商业系统中只能以一种方式完成,但在 POSTGRES 中可以通过多种不同的方式完成。例如,用户可以实现抽象数据类型,然后在输入转换例程中进行所需的检查。或者,他可以使用 POSTGRES 规则系统中的规则。最后,他可以对当前关系系统中与外键对应的字段使用 POSTQUEL 函数。这三种解决方案之间存在复杂的性能权衡,必须由复杂的应用程序做出决定设计师。我们担心那些在现有关系系统的数据库设计方面遇到困难的真正用户会发现下一代数据模型(例如 POSTGRES 中的模型)极其复杂。问题在于,在应用程序中,每种表示形式都是唯一可接受的表示形式。对数据库技术更广泛应用的需求确保供应商将生产具有这些更复杂数据模型的系统。

In this section, we summarize our opinions about certain aspects of the design of POSTGRES. First, we are uneasy about the complexity of the POSTGRES data model. The comments in Section II all contain suggestions to make it more complex. Moreover, other research teams have tended to construct even more complex data models, e.g., EXTRA [9]. Consequently, a simple concept such as referential integrity, which can be done in only one way in existing commercial systems, can be done in several different ways in POSTGRES. For example, the user can implement an abstract data type and then do the required checking in the input conversion routine. Alternately, he can use a rule in the POSTGRES rules system. Lastly, he can use a POSTQUEL function for the field that corresponds to the foreign key in a current relational system. There are complex performance tradeoffs between these three solutions, and a decision must be made by a sophisticated application designer. We fear that real users, who have a hard time with database design for existing relational systems, will find the next-generation data models, such as the one in POSTGRES, impossibly complex. The problem is that applications exist where each representation is the only acceptable one. The demand for wider application of database technology ensures that vendors will produce systems with these more complex data models.

另一个令人不安的原因是规则和 POSTQUEL 函数在功能上有很大的重叠。例如,POSTQUEL 函数可以通过每条记录的一个规则来模拟,尽管会带来一些性能损失。另一方面,除了检索总是命令之外的所有规则都可以使用 POSTQUEL 函数交替实现。我们希望在版本 2 中合并这两个概念,我们的建议出现在 [ 32 ] 中。

Another source of uneasiness is the fact that rules and POSTQUEL functions have substantial overlap in function. For example, a POSTQUEL function can be simulated by one rule per record, albeit at some performance penalty. On the other hand, all rules, except retrieve always commands, can be alternately implemented using POSTQUEL functions. We expect to merge the two concepts in Version 2, and our proposal appears in [32].

在规则和存储管理方面,我们对POSTGRES的能力基本满意。规则系统的语法应按照第三节所述进行更改;然而,这并不是一个重要的问题,并且应该可以在版本 2 中轻松实现。存储管理器的实现非常简单。崩溃恢复代码很容易编写,因为唯一必须仔细编写的例程是吸尘器。此外,了解过去的历史似乎是一种非常理想的能力。

In the areas of rules and storage management, we are basically satisfied with POSTGRES capabilities. The syntax of the rule system should be changed as noted in Section III; however, this is not a significant issue and it should be available easily in Version 2. The storage manager has been quite simple to implement. Crash recovery code has been easy to write because the only routine which must be carefully written is the vacuum cleaner. Moreover, access to past history seems to be a highly desirable capability.

此外,POSTGRES 的实现肯定犯了过于复杂的错误。例如,可以即时添加新类型和函数,而无需重新编译 POSTGRES。构建一个需要重新编译以添加新类型的系统会简单得多。其次,我们在版本1中实现了完整的交易系统。其他原型往往假设单用户环境。通过这些以及许多其他方式,我们力求具有实质性的普遍性。然而,最终的结果是减慢了实施工作并使 POSTGRES 内部结构更加复杂。因此,与 INGRES 的原始版本相比,POSTGRES 的构建时间要长得多。人们可以将其称为“第二系统”效应。

Furthermore, the POSTGRES implementation certainly erred in the direction of excessive sophistication. For example, new types and functions can be added on-the-fly without recompiling POSTGRES. It would have been much simpler to construct a system that required recompilation to add a new type. Second, we have implemented a complete transaction system in Version 1. Other prototypes tend to assume a single user environment. In these and many other ways, we strove for substantial generality; however, the net effect has been to slow down the implementation effort and make the POSTGRES internals much more complex. As a result, POSTGRES has taken us considerably longer to build than the original version of INGRES. One could call this the “second system” effect. It was essential that POSTGRES be more usable than the original INGRES prototype in order for us to feel like we were making a contribution.

最后一个评论涉及向商业系统的技术转移。看来这个过程正在大大加速。例如,关系模型是在 1970 年构建的,第一个实现原型在 1976-1977 年左右出现,商业版本在 1981 年左右首次出现,关系系统在市场上的流行发生在 1985 年左右。因此,有 15 年的时间段。想法被转移到商业系统中。POSTGRES 和其他下一代系统中的大多数想法都可以追溯到 1984 年或更晚。体现其中一些想法的商业系统已经出现,预计主要供应商将在未来一两年内拥有先进的系统。因此,15 年的期限似乎已缩短至不足一半。这种加速令人印象深刻,但它将导致当前原型系列的寿命相当短。

A last comment concerns technology transfer to commercial systems. It appears that the process is substantially accelerating. For example, the relational model was constructed in 1970, first prototypes of implementations appeared around 1976–1977, commercial versions first surfaced around 1981 and popularity of relational systems in the marketplace occurred around 1985. Hence, there was a 15 year period during which the ideas were transferred to commercial systems. Most of the ideas in POSTGRES and in other next-generation systems date from 1984 or later. Commercial systems embodying some of these ideas have already appeared and major vendors are expected to have advanced systems within the next year or two. Hence, the 15 year period appears to have shrunk to less than half that amount. This acceleration is impressive, but it will lead to rather short lifetimes for the current collection of prototypes.

参考

References

  [1] R. Agrawal 和 N. Gehani,“ODE:语言和数据模型”,Proc 中。1989 年 ACM 西格莫德。会议管理数据,俄勒冈州波特兰,1989 年 5 月。

  [1]  R. Agrawal and N. Gehani, “ODE: The language and the data model,” in Proc. 1989 ACM SIGMOD. Conference Management Data, Portland, OR, May 1989.

  [ 2 ] M. Atkinson等人,“面向对象的数据库系统宣言”,Altair Tech。Rep. 30-89,法国罗康古,1989 年 8 月。

  [2]  M. Atkinson et al., “The object-oriented database system manifesto,” Altair Tech. Rep. 30-89, Rocquencourt, France, Aug. 1989.

  [ 3 ] Anon等人,“交易处理能力的衡量标准”,Tandem Computers,Tech。代表 85.1,加利福尼亚州库比蒂诺,1985 年。

  [3]  Anon et al., “A measure of transaction processing power,” Tandem Computers, Tech. Rep. 85.1, Cupertino, CA, 1985.

  [ 4 ] P. Aoki,“POSTGRES 中扩展索引的实现”,Electron。资源。实验室。大学。加利福尼亚州,科技。众议员 89-62,1989 年 7 月。

  [4]  P. Aoki, “Implementation of extended indexes in POSTGRES,” Electron. Res. Lab. Univ. of California, Tech. Rep. 89-62, July 1989.

  [ 5 ] F. Bancilhon 和 R. Ramakrishnan,“递归查询处理的业余介绍”,Proc 中。1986 ACM SIGMOD 会议 管理数据,华盛顿特区。1986 年 5 月。

  [5]  F. Bancilhon and R. Ramakrishnan, “An amateur’s introduction to recursive query processing,” in Proc. 1986 ACM SIGMOD Conf. Management Data, Washington, DC. May 1986.

  [ 6 ] J. Banerjee等人,“面向对象数据库中模式演化的语义和实现”,Proc. 1987 ACMSIGMOD 会议 管理数据,加利福尼亚州旧金山,1987 年 5 月。

  [6]  J. Banerjee et al., “Semantics and implementation of schema evolution in object-oriented databases,” in Proc. 1987 ACMSIGMOD Conf. Management Data, San Francisco, CA, May 1987.

  [ 7 ] D. Bitton等人,“基准数据库系统:一种系统方法”,Proc. 1983 VLDB 会议,法国戛纳,1983 年 9 月。

  [7]  D. Bitton et al., “Benchmarking database systems: A systematic approach,” in Proc. 1983 VLDB Conf., Cannes, France, Sept. 1983.

  [ 8 ] A. Borgida,“信息系统中灵活处理异常的语言功能”,ACM TODS,1985 年 12 月。

  [8]  A. Borgida, “Language features for flexible handling of exceptions in information systems,” ACM TODS, Dec. 1985.

  [ 9 ] M. Carey等人,“EXODUS 的数据模型和查询语言”,Proc. 1988 年 ACM SJGMOD 会议 管理数据,伊利诺伊州芝加哥,1988 年 6 月。

  [9]  M. Carey et al., “A data model and query language for EXODUS,” in Proc. 1988 ACM SJGMOD Conf. Management Data, Chicago, IL, June 1988.

[ 10 ] G. Copeland 和 D. Maier,“让 Smalltalk 成为数据库系统”,Proc. 1984 年 ACM SIGMOD 会议 管理数据,马萨诸塞州波士顿,1984 年 6 月。

[10]  G. Copeland and D. Maier, “Making Smalltalk a database system,” in Proc. 1984 ACM SIGMOD Conf. Management Data, Boston, MA, June 1984.

[ 11 ] P. Damam等人,“支持 NF2 关系的 DBMS 原型”,Proc. 1986 ACM SIGMOD 会议 管理数据,华盛顿特区,1986 年 5 月。

[11]  P. Dadam et al., “A DBMS prototype to support NF2 relations,” in Proc. 1986 ACM SIGMOD Conf. Management Data, Washington, DC, May 1986.

[ 12 ] C. 日期,“参照完整性”,见Proc。第七国际。VLDB 会议,法国戛纳,1981 年 9 月。

[12]  C. Date, “Referential integrity,” in Proc. Seventh Int. VLDB Conf., Cannes, France, Sept. 1981.

[ 13 ] K. Eswaren,“集成数据库系统中规则子系统的规范、实现和交互”,IBM Res.,San Jose,CA,Res.。众议员 RJ1820,1976 年 8 月。

[13]  K. Eswaren, “Specification, implementation and interactions of a rule subsystem in an integrated database system,” IBM Res., San Jose, CA, Res. Rep. RJ1820, Aug. 1976.

[ 14 ] C. Faloutsos等人,“面向对象的空间访问方法的分析”,Proc. 1987 年 ACM SJGMOD 会议 管理数据,加利福尼亚州旧金山,1987 年 5 月。

[14]  C. Faloutsos et al., “Analysis of object oriented spatial access methods,” in Proc. 1987 ACM SJGMOD Conf. Management Data, San Francisco, CA, May 1987.

[ 15 ] A. Gutman,“R 树:用于空间搜索的动态索引结构”,Proc 中。1984 年 ACM SJGMOD 会议 管理数据,马萨诸塞州波士顿,1984 年 6 月。

[15]  A. Gutman, “R-trees: A dynamic index structure for spatial searching,” in Proc. 1984 ACM SJGMOD Conf. Management Data, Boston, MA, June 1984.

[ 16 ] C. Kolovson 和 M. Stonebraker,“分段搜索树及其在数据库中的应用”,正在准备中。

[16]  C. Kolovson and M. Stonebraker, “Segmented search trees and their application to data bases,” in preparation.

[ 17 ] C. Lynch 和 M. Stonebraker,“扩展用户定义索引及其在文本数据库中的应用”,Proc. 1988 VLDB 会议,加利福尼亚州洛杉矶,1988 年 9 月。

[17]  C. Lynch and M. Stonebraker, “Extended user-defined indexing with application to textual databases,” in Proc. 1988 VLDB Conf., Los Angeles, CA, Sept. 1988.

[ 18 ] D. Maier,“为什么没有面向对象的数据模型?” 在过程中。第 11 届 IFIP 世界大会,加利福尼亚州旧金山,1989 年 8 月。

[18]  D. Maier, “Why isn’t there an object-oriented data model?” in Proc. 11th IFIP World Congress, San Francisco, CA, Aug. 1989.

[ 19 ] S. Osborne 和 T. Heaven,“以抽象数据类型作为域的关系系统的设计”,ACM TODS,1986 年 9 月。

[19]  S. Osborne and T. Heaven, “The design of a relational system with abstract data types as domains,” ACM TODS, Sept. 1986.

[ 20 ] J. Richardson 和 M. Carey,“EXODUS 中数据库系统实现的编程结构”,Proc. 1987 ACM SIGMOD 会议 管理数据,加利福尼亚州旧金山,1987 年 5 月。

[20]  J. Richardson and M. Carey, “Programming constructs for database system implementation in EXODUS,” in Proc. 1987 ACM SIGMOD Conf. Management Data, San Francisco, CA, May 1987.

[ 21 ] L. Rowe 和 M. Stonebraker,“POSTGRES 数据模型”,Proc. 1987 VLDB 会议,英国布莱顿,1987 年 9 月。

[21]  L. Rowe and M. Stonebraker, “The POSTGRES data model,” in Proc. 1987 VLDB Conf., Brighton, England, Sept. 1987.

[ 22 ] L. Rowe等人,“毕加索的设计与实现”,正在准备中。

[22]  L. Rowe et al., “The design and implementation of Picasso,” in preparation.

[ 23 ] T. Sellis,“全局查询优化”,Proc. 1986 ACM SIGMOD 会议 管理数据,华盛顿特区,1986 年 6 月。

[23]  T. Sellis, “Global query optimization,” in Proc. 1986 ACM SIGMOD Conf. Management Data, Washington, DC, June 1986.

[ 24 ] M. Stonebraker,“通过查询修改实现完整性约束和视图”,Proc. 1975 年 ACM SIGMOD 会议,加利福尼亚州圣何塞,1975 年 5 月。

[24]  M. Stonebraker, “Implementation of integrity constraints and views by query modification,” in Proc. 1975 ACM SIGMOD Conf., San Jose, CA, May 1975.

[ 25 ] M. Stonebraker等人,“关系数据库管理系统的规则系统”,Proc. 第二国际。会议。数据库,以色列耶路撒冷,1982 年 6 月。纽约:学术。

[25]  M. Stonebraker et al., “A rules system for a relational data base management system,” in Proc. 2nd Int. Conf. Databases, Jerusalem, Israel, June 1982. New York: Academic.

[ 26 ] M. Stonebraker 和 L. Rowe,“POSTGRES 的设计”,Proc. 1986 ACM-SJGMOD 会议,华盛顿特区,1986 年 6 月。

[26]  M. Stonebraker and L. Rowe, “The design of POSTGRES,” in Proc. 1986 ACM-SJGMOD Conf., Washington, DC, June 1986.

[ 27 ] M. Stonebraker,“在关系数据库系统中包含新类型”,Proc. 第二国际。会议。数据工程,加利福尼亚州洛杉矶,1986 年 2 月。

[27]  M. Stonebraker, “Inclusion of new types in relational data base systems,” in Proc. Second Int. Conf. Data Eng., Los Angeles, CA, Feb. 1986.

[ 28 ] ———,“POSTGRES 存储系统”,Proc. 1987 年 VLDB 会议,英国布莱顿,1987 年 9 月。

[28]  ———, “The POSTGRES storage system,” in Proc. 1987 VLDB Conf., Brighton, England, Sept. I987.

[ 29 ] M. Stonebraker等人,“POSTGRES 中的扩展性”,IEEE 数据库工程,1987 年 9 月。

[29]  M. Stonebraker et al., “Extensibility in POSTGRES,” IEEE Database Eng., Sept. 1987.

[ 30 ] M. Stonebraker等人,“POSTGRES 规则系统”,IEEE Trans。软件工程,1988 年 7 月。

[30]  M. Stonebraker et al., “The POSTGRES rules system,” IEEE Trans. Software Eng., July 1988.

[ 31 ] M. Stonebraker等人,“POSTGRES 规则系统评论”。SIGMOD 建议,1989 年 9 月。

[31]  M. Stonebraker et al., “Commentary on the POSTGRES rules system.” SIGMOD Rec., Sept. 1989.

[ 32 ] M. Stonebraker等人,“规则、程序和观点”,正在准备中。

[32]  M. Stonebraker et al., “Rules, procedures and views,” in preparation.

[ 33 ] J. Ullman,“数据库逻辑查询语言的实现”,ACM TODS,1985 年 9 月。

[33] J. Ullman, “Implementation of logical query languages for databases,” ACM TODS, Sept. 1985.

[ 34 ] F. Velez等人,“O2 对象管理器:概述”,GIP ALTAIR,Le Chesnay,法国,Tech。报告 27-89,1989 年 2 月。

[34]  F. Velez et al., “The O2 object manager: An overview,” GIP ALTAIR, Le Chesnay, France, Tech. Rep. 27-89, Feb. 1989.

[ 35 ] S. Wensel,Ed.,“POSTGRES 参考手册”,Electron。资源。实验室,大学。加利福尼亚州伯克利分校。CA,Rep. M88/20,1988 年 3 月。

[35]  S. Wensel, Ed., “The POSTGRES reference manual,” Electron. Res. Lab., Univ. of California, Berkeley. CA, Rep. M88/20, Mar. 1988.

[ 36 ] J. Widom 和 S. Finkelstein,“关系数据库中面向集合的生产规则的语法和语义”,IBM Res.,加利福尼亚州圣何塞,1989 年 6 月。

[36]  J. Widom and S. Finkelstein, “A syntax and semantics for set-oriented production rules in relational data bases,” IBM Res., San Jose, CA, June 1989.

Michael Stonebraker,有关照片和传记,请参阅本期,第 17 页。3

Michael Stonebraker, for a photograph and biography, see this issue, p. 3

Lawrence A. Rowe获得数学学士学位和博士学位。分别于 1970 年和 1976 年获得加州大学欧文分校信息与计算机科学博士学位。

Lawrence A. Rowe received the B.A. degree in mathematics and the Ph.D. degree in information and computer science from the University of California, Irvine, in 1970 and 1976, respectively.

自1976年以来,他一直在加州大学伯克利分校任教。他的主要研究兴趣是数据库应用开发工具和集成电路(IC)计算机集成制造。他设计并实现了 Rigel(一种类似 Pascal 的数据库编程语言)和 Forms 应用程序开发系统(一种带有集成应用程序生成器的基于表单的 4GL)。他目前正在实施一种名为 Picasso 的面向对象的图形用户界面开发系统,以及一种用于指定 IC 制造工艺计划的编程语言,称为 Berkeley Process-Flow Language。他是 Ingres., Inc. 的联合创始人兼董事,该公司销售 INGRES 关系数据库管理系统。

Since 1976, he has been on the faculty at the University of California, Berkeley. His primary research interests are database application development tools and integrated circuit (IC) computer-integrated manufacturing. He designed and implemented Rigel, a Pascal-like database programming language, and the Forms Application Development System, a forms-based 4GL with integrated application generators. He is currently implementing an object-oriented, graphical user-interface development system called Picasso and a programming language to specify IC fabrication process plans called the Berkeley Process-Flow Language. He is a co-founder and director of Ingres., Inc., which markets the INGRES relational database management system.

Michael Hirohama于 1987 年获得加州大学伯克利分校计算机科学学士学位。

Michael Hirohama received the B.A. degree in computer science from the University of California, Berkeley, in 1987.

从那时起,他一直是 POSTGRES 项目的首席程序员。

Since that time, he has been the lead programmer for the POSTGRES project.

稿件于 1989 年 8 月 15 日收到;1989 年 12 月 1 日修订。这项工作得到了国防高级研究计划局的 NASA 拨款 NAG 2-530 和陆军研究办公室的拨款 DAAL03-87-K-0083 的支持。

Manuscript received August 15, 1989; revised December 1, 1989. This work was supported by the Defense Advanced Research Projects Agency under NASA Grant NAG 2-530 and by the Army Research Office under Grant DAAL03-87-K-0083.

作者来自加州大学伯克利分校电气工程和计算机科学系,邮编 94720。

The authors are with the Department of Electrical Engineering and Computer Science, University of California, Berkeley, CA 94720.

IEEE 日志号 8933788。

IEEE Log Number 8933788.

论文最初发表于IEEE Transactions on Knowledge and Data Engineering , 2(1): 125–142, 1990。原始 DOI: 10.1109/69.50912

Paper originally published in IEEE Transactions on Knowledge and Data Engineering, 2(1): 125–142, 1990. Original DOI: 10.1109/69.50912

1 . 在本节中,读者可以互换使用术语“构造类型”、“关系”和“类”。而且。记录、实例元组这三个词同样可以互换。本节特意使用所选符号来编写,以说明第 II-E 节中讨论的有关面向对象数据库的观点。

1. In this section, the reader can use the words constructed type, relation, and class interchangeably. Moreover. the words record, instance, and tuple are similarly interchangeable. This section has been purposely written with the chosen notation to illustrate a point about object-oriented databases which is discussed in Section II-E.

INGRES的设计与实现

The Design and Implementation of INGRES

Michael Stonebraker(加州大学伯克利分校)、Eugene Wong(加州大学伯克利分校)、Peter Kreps(加州大学伯克利分校)、Gerald Held(Tandem Computers, Inc.)

Michael Stonebraker (University of California, Berkeley), Eugene Wong (University of California, Berkeley), Peter Kreps (University of California, Berkeley), Gerald Held (Tandem Computers, Inc.)

描述了 INGRES 数据库管理系统的当前运行版本(1976 年 3 月)。该多用户系统提供数据的关系视图,支持两种高级非过程数据子语言,并作为数字设备公司 PDP11/40、11/45 和 11/70 计算机的 UNIX 操作系统之上的用户进程集合运行。重点是与 (1) 将系统构建为流程、(2) 嵌入相关的设计决策和权衡通用编程语言中的一种命令语言,(3) 为处理交互而实现的算法,(4) 实现的访问方法,(5) 当前提供的并发和恢复控制,以及 (6) 用于系统目录的数据结构以及数据库管理员的角色。

The currently operational (March 1976) version of the INGRES database management system is described. This multiuser system gives a relational view of data, supports two high level nonprocedural data sublanguages, and runs as a collection of user processes on top of the UNIX operating system for Digital Equipment Corporation PDP11/40, 11/45, and 11/70 computers. Emphasis is on the design decisions and tradeoffs related to (1) structuring the system into processes, (2) embedding one command language in a general purpose programming language, (3) the algorithms implemented to process interactions, (4) the access methods implemented, (5) the concurrency and recovery control currently provided, and (6) the data structures used for system catalogs and the role of the database administrator.

还讨论了 (1) 对完整性约束的支持(仅部分可操作),(2) 尚未支持的有关视图和保护的功能,以及 (3) 有关系统的未来计划。

Also discussed are (1) support for integrity constraints (which is only partly operational), (2) the not yet supported features concerning views and protection, and (3) future plans concerning the system.

关键词和短语:关系数据库、非过程语言、查询语言、数据子语言、数据组织、查询分解、数据库优化、数据完整性、保护、并发 CR 类别:3.50、3.70、4.22、4.33、4.34

Key Words and Phrases: relational database, nonprocedural language, query language, data sublanguage, data organization, query decomposition, database optimization, data integrity, protection, concurrency CR Categories: 3.50, 3.70, 4.22, 4.33, 4.34

1  简介

1  Introduction

INGRES(交互式图形和检索系统)是一个关系数据库系统,在贝尔电话实验室 [ 22 ] 为数字设备公司 PDP 11/40、11/45 和 11/70 计算机系统开发的 UNIX 操作系统之上实现。INGRES 的实现主要使用 C 语言进行编程,C 语言是一种高级语言,UNIX 本身就是用 C 语言编写的。解析是在 YACC 的帮助下完成的,YACC 是 UNIX 上可用的编译器-编译器 [ 19 ]。

INGRES (Interactive Graphics and Retrieval System) is a relational database system which is implemented on top of the UNIX operating system developed at Bell Telephone Laboratories [22] for Digital Equipment Corporation PDP 11/40, 11/45, and 11/70 computer systems. The implementation of INGRES is primarily programmed in C, a high level language in which UNIX itself is written. Parsing is done with the assistance of YACC, a compiler-compiler available on UNIX [19].

数据库管理系统关系模型的优点已在文献 [7,10,11] 中进行了广泛讨论几乎不需要进一步阐述。在选择关系模型时,我们特别受到以下因素的激励:(a)这种模型提供的高度数据独立性,以及(b)为数据定义、检索、更新、提供高水平且完全无过程的设施的可能性,访问控制、视图支持和完整性验证。

The advantages of a relational model for database management systems have been extensively discussed in the literature [7, 10, 11] and hardly require further elaboration. In choosing the relational model, we were particularly motivated by (a) the high degree of data independence that such a model affords, and (b) the possibility of providing a high level and entirely procedure free facility for data definition, retrieval, update, access control, support of views, and integrity verification.

1.1  本文描述的方面

1.1  Aspects Described in This Paper

在本文中,我们描述了 INGRES 中做出的设计决策。我们特别强调以下方面的设计和实现: (a) 系统进程结构(有关此 UNIX 概念的讨论,请参阅第 2 节);(b) 将所有 INGRES 命令嵌入通用编程语言 C 中;(c) 实施的获取方法;(d) 目录结构和数据库管理员的角色;(e) 支持观点、保护和完整性约束;(f) 实施的分解程序;(g) 实施二级指数的更新和一致性;(h) 恢复和并发控制。

In this paper we describe the design decisions made in INGRES. In particular we stress the design and implementation of: (a) the system process structure (see Section 2 for a discussion of this UNIX notion); (b) the embedding of all INGRES commands in the general purpose programming language C; (c) the access methods implemented; (d) the catalog structure and the role of the database administrator; (e) support for views, protection, and integrity constraints; (f) the decomposition procedure implemented; (g) implementation of updates and consistency of secondary indices; (h) recovery and concurrency control.

在 1.2 节中,我们简要描述了支持的主要查询语言 Q UEL以及当前系统接受的实用命令。第二个用户界面,C UPID是一种面向图形的休闲用户语言,它也是可操作的 [ 20 , 21 ],但本文未讨论。在第 1.3 节中,我们描述了 E QUEL(嵌入式 Q UEL)预编译器,它允许用用户提供的 C 程序替换“前端”进程。该预编译器具有将所有 INGRES 嵌入到通用编程语言 C 中的效果。在第 1.4 节中,给出了一些关于 Q UEL和 E QUEL的注释。

In Section 1.2 we briefly describe the primary query language supported, QUEL, and the utility commands accepted by the current system. The second user interface, CUPID, is a graphics oriented, casual user language which is also operational [20, 21] but not discussed in this paper. In Section 1.3 we describe the EQUEL (Embedded QUEL) precompiler, which allows the substitution of a user supplied C program for the “front end” process. This precompiler has the effect of embedding all of INGRES in the general purpose programming language C. In Section 1.4 a few comments on QUEL and EQUEL are given.

在第 2 节中,我们描述了 UNIX 环境中影响我们设计决策的相关因素。此外,我们还指出了 INGRES 分为的四个流程的结构以及所实施选择背后的推理。

In Section 2 we describe the relevant factors in the UNIX environment which have affected our design decisions. Moreover, we indicate the structure of the four processes into which INGRES is divided and the reasoning behind the choices implemented.

在第 3 节中,我们指出了存在的目录(系统)关系以及数据库管理员对于数据库中所有关系的角色。还介绍了所实现的访问方法、它们的调用约定,以及在适当的情况下辅助存储中数据页的实际布局。

In Section 3 we indicate the catalog (system) relations which exist and the role of the database administrator with respect to all relations in a database. The implemented access methods, their calling conventions, and, where appropriate, the actual layout of data pages in secondary storage are also presented.

第 4、5 和 6 节分别讨论系统中三个“核心”进程各自的各种功能。还讨论了每个流程的设计和实施策略。最后,第 7 节得出结论,建议未来的扩展,并指出当前在 INGRES 上运行的应用程序的性质。

Sections 4, 5, and 6 discuss respectively the various functions of each of the three “core” processes in the system. Also discussed are the design and implementation strategy of each process. Finally, Section 7 draws conclusions, suggests future extensions, and indicates the nature of the current applications run on INGRES.

除非另有说明,本文描述的是 1976 年 3 月运行的 INGRES 系统。

Except where noted to the contrary, this paper describes the INGRES system operational in March 1976.

1.2   QUEL 和其他 INGRES 实用程序命令

1.2  QUEL and the Other INGRES Utility Commands

Q UEL (QUEry Language) 与数据语言/A LPHA [ 8 ]、S QUARE [ 3 ] 和 S EQUEL [ 4 ]有共同点,因为它是一种完整的查询语言,使程序员不必关心数据结构如何已实现以及哪些算法正在对存储的数据进行操作[ 9 ]。因此,它促进了相当程度的数据独立性[ 24 ]。

QUEL (QUEry Language) has points in common with Data Language/ALPHA [8], SQUARE [3], and SEQUEL [4] in that it is a complete query language which frees the programmer from concern for how data structures are implemented and what algorithms are operating on stored data [9]. As such it facilitates a considerable degree of data independence [24].

本节中的Q UEL示例均涉及以下关系。

The QUEL examples in this section all concern the following relations.

员工(姓名、部门、​​工资、经理、年龄)

EMPLOYEE (NAME, DEPT, SALARY, MANAGER, AGE)

部门(部门,楼层#)

DEPT   (DEPT, FLOOR#)

AQ UEL交互包括至少一个以下形式的 RANGE 语句

A QUEL interaction includes at least one RANGE statement of the form

变量列表的范围是关系名称

RANGE OF variable-list IS relation-name

该语句的目的是指定每个变量范围的关系。RANGE 语句的变量列表部分声明将用作元组参数的变量。这些称为元组变量

The purpose of this statement is to specify the relation over which each variable ranges. The variable-list portion of a RANGE statement declares variables which will be used as arguments for tuples. These are called tuple variables.

交互还包括以下形式的一个或多个语句

An interaction also includes one or more statements of the form

命令 [结果名称](目标列表)

Command  [result-name] (target-list)

[哪里的资格]

[WHERE Qualification]

这里的命令是 RETRIEVE、APPEND、REPLACE 或 DELETE。对于 RETRIEVE 和 APPEND,结果名称是将符合条件的元组检索到或附加到的关系的名称。对于 REPLACE 和 DELETE,result-name 是元组变量的名称,它通过限定来标识要修改或删除的元组。目标列表是以下形式的列表

Here Command is either RETRIEVE, APPEND, REPLACE, or DELETE. For RETRIEVE and APPEND, result-name is the name of the relation which qualifying tuples will be retrieved into or appended to. For REPLACE and DELETE, result-name is the name of a tuple variable which, through the qualification, identifies tuples to be modified or deleted. The target-list is a list of the form

结果域 = QUEL 函数....

result-domain = QUEL Function .…

这里的结果域是结果关系中的域名,将被分配相应函数的值。

Here the result-domains are domain names in the result relation which are to be assigned the values of the corresponding functions.

以下建议有效的 Q UEL交互。[ 15 ]中提供了该语言的完整描述。

The following suggest valid QUEL interactions. A complete description of the language is presented in [15].

示例 1  计算员工 Jones 的工资除以 18 岁。

Example 1  Compute salary divided by age-18 for employee Jones.

E 的范围是雇员

RANGE OF E IS EMPLOYEE

检索到 W

RETRIEVE INTO W

(COMP = E.SALARY/(E.AGE-18))

(COMP = E.SALARY/(E.AGE-18))

WHERE E.NAME =“琼斯”

WHERE E.NAME = “Jones”

这里 E 是一个元组变量,范围涵盖 EMPLOYEE 关系,并且找到该关系中满足条件 E.NAME =“Jones”的所有元组。查询的结果是一个新的关系 W,它具有为每个合格元组计算的单域 COMP。

Here E is a tuple variable which ranges over the EMPLOYEE relation, and all tuples in that relation are found which satisfy the qualification E.NAME = “Jones.” The result of the query is a new relation W, which has a single domain COMP that has been calculated for each qualifying tuple.

如果省略结果关系,则合格的元组将以显示格式写入用户终端上或返回到调用程序。

If the result relation is omitted, qualifying tuples are written in display format on the user’s terminal or returned to a calling program.

示例 2  将元组 (Jackson,candy,13000,Baker,30) 插入 EMPLOYEE。

Example 2  Insert the tuple (Jackson,candy,13000,Baker,30) into EMPLOYEE.

附加到员工(姓名=“杰克逊”,部门=“糖果”,

工资= 13000,MGR=“贝克”,年龄= 30)

APPEND TO EMPLOYEE(NAME = “Jackson”, DEPT = “candy”,

SALARY = 13000, MGR = “Baker”, AGE = 30)

这里通过将指示的元组添加到关系中来修改结果关系 EMPLOYEE。未指定的域对于数字域默认为零,对于字符串默认为 null。当前实现的一个缺点是,对于数字域,0 无法与“无值”区分开来。

Here the result relation EMPLOYEE is modified by adding the indicated tuple to the relation. Domains which are not specified default to zero for numeric domains and null for character strings. A shortcoming of the current implemenation is that 0 is not distinguished from “no value” for numeric domains.

示例 3  解雇一楼的所有人。

Example 3  Fire everybody on the first floor.

E 的范围是雇员

RANGE OF E IS EMPLOYEE

D 系的范围

RANGE OF D IS DEPT

删除 E,其中 E.DEPT = D.DEPT

DELETE E WHERE E.DEPT = D.DEPT

       并且 D.FLOOR# = 1

       AND D.FLOOR# = 1

这里E指定要修改EMPLOYEE关系。所有具有 DEPT 值的元组都将被删除,该值与一楼的某些部门相同。

Here E specifies that the EMPLOYEE relation is to be modified. All tuples are to be removed which have a value for DEPT which is the same as some department on the first floor.

示例 4  如果琼斯在一楼工作,则给他加薪 10%。

Example 4  Give a 10-percent raise to Jones if he works on the first floor.

E 的范围是雇员

RANGE OF E IS EMPLOYEE

D 系的范围

RANGE OF D IS DEPT

替换 E(工资 = 1.1*E.工资)

REPLACE E(SALARY = 1.1*E.SALARY)

其中 E.NAME =“琼斯”并且

WHERE E.NAME = “Jones” AND

      E.DEPT = D.DEPT 且 D.FLOOR# = I

      E.DEPT = D.DEPT AND D.FLOOR# = I

这里,对于 EMPLOYEE 中资格为真的那些元组,E.SALARY 将被 1.1*E.SALARY 替换。

Here E.SALARY is to be replaced by 1.1*E.SALARY for those tuples in EMPLOYEE where the qualification is true.

除了上述 Q UEL命令外,INGRES 还支持多种实用程序命令。这些实用命令可分为七大类。

In addition to the above QUEL commands, INGRES supports a variety of utility commands. These utility commands can be classified into seven major categories.

(a) 调用 INGRES:

(a) Invocation of INGRES:

INGRES 数据库名称

INGRES data-base-name

从 UNIX 执行的此命令将用户“登录”到给定数据库。(数据库只是与给定数据库管理员的命名关系集合,该管理员拥有普通用户无法获得的权力。)此后,用户可以在调用的数据库环境中发出所有其他命令(直接从 UNIX 执行的命令除外)。

This command executed from UNIX “logs in” a user to a given database. (A database is simply a named collection of relations with a given database administrator who has powers not available to ordinary users.) Thereafter the user may issue all other commands (except those executed directly from UNIX) within the environment of the invoked database.

(b) 数据库的创建和销毁:

(b) Creation and destruction of databases:

CREATEDB 数据库名称

CREATEDB data-base-name

DESTROYDB 数据库名称

DESTROYDB data-base-name

这两个命令是从 UNIX 调用的。CREATEDB 的调用者必须被授权创建数据库(以目前描述的方式),并且他自动成为数据库管理员。仅当数据库管理员调用时,DESTROYDB 才会成功销毁数据库。

These two commands are called from UNIX. The invoker of CREATEDB must be authorized to create databases (in a manner to be described presently), and he automatically becomes the database administrator. DESTROYDB successfully destroys a database only if invoked by the database administrator.

(c) 关系的建立和破坏:

(c) Creation and destruction of relations:

CREATE relname(域名 IS 格式,域名 IS 格式,...)

CREATE relname(domain-name IS format, domain-name IS format, …)

销毁重名

DESTROY relname

这些命令创建和销毁当前数据库内的关系。CREATE 命令的调用者成为所创建关系的“所有者”。用户只能破坏他拥有的关系。INGRES 当前接受的格式为 1、2 和 4 字节整数、4 和 8 字节浮点数以及 1 到 255 字节固定长度 ASCII 字符串。

These commands create and destroy relations within the current database. The invoker of the CREATE command becomes the “owner” of the relation created. A user may only destroy a relation that he owns. The current formats accepted by INGRES are 1-, 2-, and 4-byte integers, 4- and 8-byte floating point numbers, and 1-to 255-byte fixed length ASCII character strings.

(d) 数据的批量复制:

(d) Bulk copy of data:

COPY relname(域名 IS 格式, 域名 IS 格式, …)

 方向“文件名”

COPY relname(domain-name IS format, domain-name IS format, …)

 direction “file-name”

打印名称

PRINT relname

命令 COPY 将整个关系传输到名称为“filename”的 UNIX 文件或从该文件传输整个关系。方向为 TO 或 FROM。每个域的格式是对其在 UNIX 文件中如何出现(或将出现)的描述。关系 relname 必须存在并且具有与 COPY 命令中出现的域名相同的域名。但是,格式不需要一致,COPY 将自动转换数据类型。还支持 UNIX 文件中的虚拟字段和可变长度字段。

The command COPY transfers an entire relation to or from a UNIX file whose name is “filename.” Direction is either TO or FROM. The format for each domain is a description of how it appears (or is to appear) in the UNIX file. The relation relname must exist and have domain names identical to the ones appearing in the COPY command. However, the formats need not agree and COPY will automatically convert data types. Support is also provided for dummy and variable length fields in a UNIX file.

PRINT 将关系复制到用户终端上,并将其格式化为报告。从这个意义上说,它是 COPY 的程式化版本。

PRINT copies a relation onto the user’s terminal, formatting it as a report. In this sense it is stylized version of COPY.

(e) 存储结构修改:

(e) Storage structure modification:

MODIFY relname TO 存储结构 ON (key1, key2, …)

MODIFY relname TO storage-structure ON (keyl, key2, …)

INDEX ON relname IS 索引名(key1, key2, …)

INDEX ON relname IS indexname(key1, key2, …)

MODIFY 命令将关系的存储结构从一种访问方法更改为另一种访问方法。当前支持的五种访问方法将在第 3 节中讨论。指示的键是 relname 中的域,它们从左到右连接以形成组合键,该组合键用于除一种访问方法之外的所有访问方法中的元组组织。只有关系的所有者可以修改其存储结构。

The MODIFY command changes the storage structure of a relation from one access method to another. The five access methods currently supported are discussed in Section 3. The indicated keys are domains in relname which are concatenated left to right to form a combined key which is used in the organization of tuples in all but one of the access methods. Only the owner of a relation may modify its storage structure.

INDEX 为关系创建二级索引。它的域为 key1、key2、...、指针。域“指针”是索引关系中元组的唯一标识符,该索引关系具有 key1、key2、... 的给定值。EMPLOYEE 的名为 AGEINDEX 的索引可能是以下二元关系(假设 EMPLOYEE 中有 6 个元组,具有适当的值)姓名和年龄)。

INDEX creates a secondary index for a relation. It has domains of key1, key2, …, pointer. The domain “pointer” is the unique identifier of a tuple in the indexed relation having the given values for key1, key2, .… An index named AGEINDEX for EMPLOYEE might be the following binary relation (assuming that there are six tuples in EMPLOYEE with appropriate names and ages).

年龄

Age

指针

Pointer

25

25

Smith 元组的标识符

identifier for Smith’s tuple

32

32

琼斯元组的标识符

identifier for Jones’s tuple

年龄指数

AGEINDEX

36

36

Adams 元组的标识符

identifier for Adams’s tuple

29

29

Johnson 元组的标识符

identifier for Johnson’s tuple

47

47

Baker元组的标识符

identifier for Baker’s tuple

58

58

Harding 元组的标识符

identifier for Harding’s tuple

关系索引名反过来像任何其他关系一样被处理和访问,除了当它索引的关系更新时它会自动更新。当然,只有关系的所有者才能为其创建和销毁二级索引。

The relation indexname is in turn treated and accessed just like any other relation, except it is automatically updated when the relation it indexes is updated. Naturally, only the owner of a relation may create and destroy secondary indexes for it.

(f) 一致性和完整性控制:

(f) Consistency and integrity control:

完整性约束是资格

INTEGRITY CONSTRAINT is qualification

完整性约束列表 relname

INTEGRITY CONSTRAINT LIST relname

完整性约束关闭 relname

INTEGRITY CONSTRAINT OFF relname

完整性约束关闭(整数,...,整数)

INTEGRITY CONSTRAINT OFF (integer, …, integer)

恢复数据库名称

RESTORE data-base-name

前四个命令支持完整性约束的插入、列出、删除和选择性删除,这些约束将针对与关系的所有交互强制执行。第 4 节讨论了处理这种强制执行的机制。最后一个命令在系统崩溃后将数据库恢复到一致状态。它必须从 UNIX 执行,其操作将在第 6 节中讨论。RESTORE 命令仅对数据库管理员可用。

The first four commands support the insertion, listing, deletion, and selective deletion of integrity constraints which are to be enforced for all interactions with a relation. The mechanism for handling this enforcement is discussed in Section 4. The last command restores a database to a consistent state after a system crash. It must be executed from UNIX, and its operation is discussed in Section 6. The RESTORE command is only available to the database administrator.

(g) 杂项:

(g) Miscellaneous:

帮助 [别名或手册部分]

HELP [relname or manual-section]

保存名称直到过期日期

SAVE relname UNTIL expiration-date

清除数据库名称

PURGE data-base-name

HELP 提供有关所调用的系统或数据库的信息。当使用可选参数(即命令名称)调用时,HELP 将返回 INGRES 参考手册 [ 31 ]中的相应页面。当使用关系名称作为参数调用时,它返回有关该关系的所有信息。不带任何参数,它返回有关当前数据库中所有关系的信息。

HELP provides information about the system or the database invoked. When called with an optional argument which is a command name, HELP returns the appropriate page from the INGRES reference manual [31]. When called with a relation name as an argument, it returns all information about that relation. With no argument at all, it returns information about all relations in the current database.

SAVE 是一种机制,用户可以通过该机制声明其将关系保留到指定时间的意图。PURGE 是一个 UNIX 命令,数据库管理员可以调用它来删除“过期日期”已过的所有关系。当数据库空间耗尽时应执行此操作。(数据库管理员还可以使用 DESTROY 命令从数据库中删除任何关系,无论其所有者是谁。)

SAVE is the mechanism by which a user can declare his intention to keep a relation until a specified time. PURGE is a UNIX command which can be invoked by a database administrator to delete all relations whose “expiration-dates” have passed. This should be done when space in a database is exhausted. (The database administrator can also remove any relations from his database using the DESTROY command, regardless of who their owners are.)

此时应注意两点意见。

Two comments should be noted at this time.

(a) 系统当前接受[ 15 ]中Q UEL 1指定的语言;正在进行扩展以接受 Q UEL n。(b) 系统目前不接受意见或保护声明。尽管算法已被指定[ 25 , 27 ],但它们尚未可操作。因此,本节没有给出这些语句的语法;然而,该主题将在第 4 节中进一步讨论。

(a) The system currently accepts the language specified as QUEL1 in [15]; extension is in progress to accept QUELn. (b) The system currently does not accept views or protection statements. Although the algorithms have been specified [25, 27], they are not yet operational. For this reason no syntax for these statements is given in this section; however the subject is discussed further in Section 4.

1.3  等于

1.3  EQUEL

尽管 Q UEL单独提供了满足许多数据管理要求的灵活性,但有些应用程序需要定制的用户界面来代替 Q UEL语言。出于这个以及其他原因,除了 Q UEL的数据库功能之外,拥有通用编程语言的灵活性通常也是有用的。为此,实现了一种新语言E QUEL(嵌入式Q UEL ),它由嵌入通用编程语言C的Q UEL组成。

Although QUEL alone provides the flexibility for many data management requirements, there are applications which require a customized user interface in place of the QUEL language. For this as well as other reasons, it is often useful to have the flexibility of a general purpose programming language in addition to the database facilities of QUEL. To this end, a new language, EQUEL (Embedded QUEL), which consists of QUEL embedded in the general purpose programming language C, has been implemented.

在 E QUEL的设计中,设定了以下目标: (a) 新语言必须具有 C 和 Q UEL的全部功能。(b) C 程序应具有单独处理每个元组的能力,从而满足 RETRIEVE 语句中的限定条件。(这是数据语言/A LPHA [ 8 ]中描述的“管道”返回工具。)

In the design of EQUEL the following goals were set: (a) The new language must have the full capabilities of both C and QUEL. (b) The C program should have the capability for processing each tuple individually, thereby satisfying the qualification in a RETRIEVE statement. (This is the “piped” return facility described in Data Language/ALPHA [8].)

考虑到这些目标,E QUEL的定义如下:

With these goals in mind, EQUEL was defined as follows:

(a) 任何 C 语言语句都是有效的 E QUEL语句。

(a)  Any C language statement is a valid EQUEL statement.

(b) 任何 Q UEL语句(或 INGRES 实用程序命令)只要以两个数字符号 (##) 为前缀,都是有效的 E QUEL语句。

(b)  Any QUEL statement (or INGRES utility command) is a valid EQUEL statement as long as it is prefixed by two number signs (##).

(c) C 程序变量可以在 Q UEL语句中的任何地方使用,但作为命令名称除外。以这种方式使用的 C 变量的声明语句也必须以双数字符号为前缀。

(c)  C program variables may be used anywhere in QUEL statements except as command names. The declaration statements of C variables used in this manner must also be prefixed by double number signs.

(d) 没有结果关系的 RETRIEVE 语句具有以下形式

(d)  RETRIEVE statements without a result relation have the form

检索(目标列表)

RETRIEVE (target-list)

     [资格在哪里]

     [WHERE qualification]

图像

这导致 C 块针对每个合格元组执行一次。

which results in the C-block being executed once for each qualifying tuple.

两个简短的示例说明了 EQUEL 语法。

Two short examples illustrate EQUEL syntax.

示例 5  以下程序实现了 INGRES 的一个小型前端,该前端仅执行一个查询。它读取员工的姓名并以合适的格式打印出员工的工资。只要有名称要读入,它就会继续执行此操作。函数 READ 和 PRINT 具有明显的含义。

Example 5  The following program implements a small front end to INGRES which performs only one query. It reads in the name of an employee and prints out the employee’s salary in a suitable format. It continues to do this as long as there are names to be read in. The functions READ and PRINT have the obvious meaning.

图像

在此示例中,C 变量EMPNAME用于 Q UEL语句的限定,并且对于每个限定元组,C 变量SAL设置为适当的值,然后执行 PRINT 语句。

In this example the C variable EMPNAME is used in the qualification of the QUEL statement, and for each qualifying tuple the C variable SAL is set to the appropriate value and then the PRINT statement is executed.

示例 6  读入一个关系名称和两个域名。然后,对于第二个域要采用的值集合中的每个值,对第一个域采用的所有值进行一些处理。(我们假设函数 PROCESS 存在并且具有明显的含义。)该程序的更复杂的版本可以用作简单的报告生成器。

Example 6  Read in a relation name and two domain names. Then for each of a collection of values which the second domain is to assume, do some processing on all values which the first domain assumes. (We assume the function PROCESS exists and has the obvious meaning.) A more elaborate version of this program could serve as a simple report generator.

图像
图像

任何 RANGE 声明(在本例中是 X 的声明)都被 INGRES 假定保留,直到重新定义。因此,无论 RETRIEVE 语句执行多少次,都只需要一个 RANGE 语句。请清楚地注意,除 INGRES 命令名称之外的任何内容都可以是 C 变量。在上面的示例中,RELNAME是用作关系名称的 C 变量,而DOMNAMEDOMNAME 2 用作域名。

Any RANGE declaration (in this case the one for X) is assumed by INGRES to hold until redefined. Hence only one RANGE statement is required, regardless of the number of times the RETRIEVE statement is executed. Note clearly that anything except the name of an INGRES command can be a C variable. In the above example RELNAME is a C variable used as a relation name, while DOMNAME and DOMNAME 2 are used as domain names.

1.4  对 QUEL 和 EQUEL 的评论

1.4  Comments on QUEL and EQUEL

在本节中,做了一些评论,指出了 Q UEL和 E QUEL之间的差异以及选定的其他建议的数据子语言和嵌入式数据子语言。

In this section a few remarks are made indicating differences between QUEL and EQUEL and selected other proposed data sublanguages and embedded data sublanguages.

Q UEL从数据语言/A LPHA借用了很多。主要区别是: (a) Q UEL中提供算术;数据语言/A LPHA建议依赖宿主语言来实现此功能。(b) Q UEL中不存在量词。这导致根据 RANGE 语句中声明的关系的叉积函数对语言进行一致的语义解释。因此,Q UEL被其设计者认为是一种基于函数而不是一阶谓词演算的语言。(c) Q UEL提供了更强大的聚合能力。

QUEL borrows much from Data Language/ALPHA. The primary differences are: (a) Arithmetic is provided in QUEL; Data Language/ALPHA suggests reliance on a host language for this feature. (b) No quantifiers are present in QUEL. This results in a consistent semantic interpretation of the language in terms of functions on the crossproduct of the relations declared in the RANGE statements. Hence, QUEL is considered by its designers to be a language based on functions and not on a first order predicate calculus. (c) More powerful aggregation capabilities are provided in QUEL.

最新版本的 S EQUEL [ 2 ] 已变得相当接近 Q UEL。读者可以参考 [ 2 ] 的示例 1(b),它提出了 Q UEL语法的变体。Q UEL和 S EQUEL之间的主要区别似乎是: (a) S EQUEL在可能的情况下允许使用面向块的表示法而没有元组变量的语句。(b) S EQUEL的聚合工具似乎与 Q UEL中定义的聚合工具不同。

The latest version of SEQUEL [2] has grown rather close to QUEL. The reader is directed to Example 1(b) of [2], which suggests a variant of the QUEL syntax. The main differences between QUEL and SEQUEL appear to be: (a) SEQUEL allows statements with no tuple variables when possible using a block oriented notation. (b) The aggregation facilities of SEQUEL appear to be different from those defined in QUEL.

系统 R [ 2 ] 包含 S EQUEL和 PL/1 或其他主机语言之间的建议接口。该接口与 E QUEL有很大不同,并且包含显式游标和变量绑定。这两个概念都隐含在 E QUEL中。感兴趣的读者应该对比提供嵌入式数据子语言的两种不同方法。

System R [2] contains a proposed interface between SEQUEL and PL/1 or other host language. This interface differs substantially from EQUEL and contains explicit cursors and variable binding. Both notions are implicit in EQUEL. The interested reader should contrast the two different approaches to providing an embedded data sublanguage.

2   INGRES 流程结构

2  The INGRES Process Structure

INGRES 的调用方式有两种:一是在 UNIX 下直接执行 INGRES database-name 来调用;其次,可以通过执行使用 E QUEL预编译器编写的程序来调用它。我们依次讨论每个机制,然后简要评论为什么存在两种机制。然而,在继续之前,必须先介绍一些有关 UNIX 的细节。

INGRES can be invoked in two ways: First, it can be directly invoked from UNIX by executing INGRES database-name; second, it can be invoked by executing a program written using the EQUEL precompiler. We discuss each in turn and then comment briefly on why two mechanisms exist. Before proceeding, however, a few details concerning UNIX must be introduced.

2.1   UNIX环境

2.1  The UNIX Environment

本节中值得提及有关 UNIX 的两点。

Two points concerning UNIX are worthy of mention in this section.

(a) UNIX 文件系统。UNIX 支持类似于 MULTICS 的树形结构文件系统。每个文件要么是一个目录(包含对文件系统中的后代文件的引用),要么是一个数据文件。每个文件在物理上被分为 512 字节的块(页)。为了响应读取请求,UNIX 将一页或多页从辅助内存移动到 UNIX 核心缓冲区,然后将所需的实际字节字符串返回给用户。如果同一页在核心缓冲区中再次被引用(由同一个或另一个用户),则不会发生磁盘 I/O。

(a) The UNIX file system. UNIX supports a tree structured file system similar to that of MULTICS. Each file is either a directory (containing references to descendant files in the file system) or a data file. Each file is divided physically into 512-byte blocks (pages). In response to a read request, UNIX moves one or more pages from secondary memory to UNIX core buffers and then returns to the user the actual byte string desired. If the same page is referenced again (by the same or another user) while it is still in a core buffer, no disk I/O takes place.

需要注意的是,UNIX 使用“最近最少使用”替换算法将文件系统中的数据分页进出系统缓冲区。通过这种方式,整个文件系统作为一个大型虚拟存储进行管理。

It is important to note that UNIX pages data from the file system into and out of system buffers using a “least recently used” replacement algorithm. In this way the entire file system is managed as a large virtual store.

INGRES 设计者认为数据库系统应该作为 UNIX 的用户作业出现。(否则,系统将在非标准 UNIX 上运行并变得不太可移植。)此外,设计者认为 UNIX 应该管理正在运行的混合作业的系统缓冲区。因此,INGRES 不包含进行其自己的内存管理的工具。

The INGRES designers believe that a database system should appear as a user job to UNIX. (Otherwise, the system would operate on a nonstandard UNIX and become less portable.) Moreover the designers believe that UNIX should manage the system buffers for the mix of jobs being run. Consequently, INGRES contains no facilities to do its own memory management.

(b) UNIN 进程结构。UNIX 中的进程是一个地址空间(在 11/40 上为 64K 字节或更少,在 11/45 或 11/70 上为 128K 字节或更少),它与用户 ID 相关联,并且是由进程调度的工作单元。 UNIX 调度程序。进程可以“分叉”子进程;因此,父进程可以是进程子树的根。此外,进程可以请求 UNIX 在后代进程中执行文件。这些进程可以通过称为“管道”的进程间通信设施相互通信。管道可以被声明为单向通信链路,由一个进程写入并由第二个进程读取。UNIX 保持管道同步,因此不会丢失消息。每个进程都有一个“标准输入设备”和一个“标准输出设备”。这些通常是用户的终端,

(b) The UNIN process structure. A process in UNIX is an address space (64K bytes or less on an 11/40, 128K bytes or less on an 11/45 or 11/70) which is associated with a user-id and is the unit of work scheduled by the UNIX scheduler. Processes may “fork” subprocesses; consequently a parent process can be the root of a process subtree. Furthermore, a process can request that UNIX execute a file in a descendant process. Such processes may communicate with each other via an interprocess communication facility called “pipes.” A pipe may be declared as a one direction communication link which is written into by one process and read by a second one. UNIX maintains synchronization of pipes so no messages are lost. Each process has a “standard input device” and a “standard output device.” These are usually the user’s terminal, but may be redirected by the user to be files, pipes to other processes, or other devices.

图像

图1   INGRES流程结构

Figure 1  INGRES process structure

最后,UNIX 为执行可重入代码的进程提供了一种工具,以便在可能的情况下共享过程段。INGRES 利用此功能,因此多个并发用户的核心空间开销仅为数据段所需的开销。

Last, UNIX provides a facility for processes executing reentrant code to share procedure segments if possible. INGRES takes advantage of this facility so the core space overhead of multiple concurrent users is only that required by data segments.

2.2  从UNIX调用

2.2  Invocation from UNIX

将 INGRES 作为 UNIX 命令发出会导致创建图 1所示的进程结构。本节将说明这四个过程的功能。2.4 节给出了这种特定结构的理由。

Issuing INGRES as a UNIX command causes the process structure shown in Figure 1 to be created. In this section the functions in the four processes will be indicated. The justification of this particular structure is given in Section 2.4.

进程 1 是一个交互式终端监视器,允许用户制定、打印、编辑和执行 INGRES 命令集合。它维护一个供用户交互的工作空间,直到用户对交互满意为止。当需要执行时,该工作区的内容将作为 ASCII 字符串向下传递到管道 A。当前终端监视器接受的命令集在[ 31 ]中指示。

Process 1 is an interactive terminal monitor which allows the user to formulate, print, edit, and execute collections of INGRES commands. It maintains a workspace with which the user interacts until he is satisfied with his interaction. The contents of this workspace are passed down pipe A as a string of ASCII characters when execution is desired. The set of commands accepted by the current terminal monitor is indicated in [31].

如上所述,UNIX 允许用户在执行命令时更改其进程的标准输入和输出设备。因此,INGRES 的调用者可以指示终端监视器从用户文件获取输入(在这种情况下,他运行交互的“预设”集合)并将输出直接输出到另一个设备(例如行式打印机)或文件。

As noted above, UNIX allows a user to alter the standard input and output devices for his processes when executing a command. As a result the invoker of INGRES may direct the terminal monitor to take input from a user file (in which case he runs a “canned” collection of interactions) and direct output to another device (such as the line printer) or file.

进程 2 包含词法分析器、解析器、用于完整性控制(以及将来对视图和保护的支持)的查询修改例程以及并发控制。然而,由于尺寸限制,完整性控制当前发布的系统中没有例程。当进程 2 完成时,它通过管道 B 将一串令牌传递给进程 3。进程 2 将在第 4 节中讨论。

Process 2 contains a lexical analyzer, a parser, query modification routines for integrity control (and, in the future, support of views and protection), and concurrency control. Because of size constraints, however, the integrity control routines are not in the currently released system. When process 2 finishes, it passes a string of tokens to process 3 through pipe B. Process 2 is discussed in Section 4.

进程 3 接受此标记字符串并包含命令 RETRIEVE、REPLACE、DELETE 和 APPEND 的执行例程。任何更新都会变成 RETRIEVE 命令以隔离要更改的元组。修改后的元组的修订副本被假脱机到一个特殊文件中。然后,该文件由进程 4 中的“延迟更新处理器”进行处理,这将在第 6 节中讨论。

Process 3 accepts this token string and contains execution routines for the commands RETRIEVE, REPLACE, DELETE, and APPEND. Any update is turned into a RETRIEVE command to isolate tuples to be changed. Revised copies of modified tuples are spooled into a special file. This file is then processed by a “deferred update processor” in process 4, which is discussed in Section 6.

基本上,进程 3 为 RETRIEVE 命令执行两个功能。(a) 多变量查询被分解为仅涉及单个变量的交互序列。(b)一变量查询由一变量查询处理器(OVQP)执行。OVQP 又通过调用访问方法来执行其功能。这两个函数将在第 5 节中讨论;访问方法在第 3 节中说明。

Basically, process 3 performs two functions for RETRIEVE commands. (a) A multivariable query is decomposed into a sequence of interactions involving only a single variable. (b) A one-variable query is executed by a one-variable query processor (OVQP). The OVQP in turn performs its function by making calls on the access methods. These two functions are discussed in Section 5; the access methods are indicated in Section 3.

支持实用程序命令(CREATE、DESTROY、INDEX 等)的所有代码都驻留在进程 4 中。进程 3 只是将进程 4 将执行的任何命令传递给进程 4。进程 4 被组织为完成各种功能的覆盖集合。其中一些功能将在第 6 节中讨论。

All code to support utility commands (CREATE, DESTROY, INDEX, etc.) resides in process 4. Process 3 simply passes to process 4 any commands which process 4 will execute. Process 4 is organized as a collection of overlays which accomplish the various functions. Some of these functions are discussed in Section 6.

错误消息通过管道 D、E 和 F 传回进程 1,进程 1 将其返回给用户。如果命令是未指定结果关系的 RETRIEVE,则进程 3 将以程式化格式将合格元组直接返回到进程 1 的“标准输出设备”。除非重定向,否则这是用户的终端。

Error messages are passed back through pipes D, E, and F to process 1, which returns them to the user. If the command is a RETRIEVE with no result relation specified, process 3 returns qualifying tuples in a stylized format directly to the “standard output device” of process 1. Unless redirected, this is the user’s terminal.

2.3   EQUEL 调用

2.3  Invocation from EQUEL

现在我们来看看预编译器中的代码调用 INGRES 时的操作。

We now turn to the operation of INGRES when invoked by code from the precompiler.

为了实现 E QUEL,编写了一个翻译器(预编译器)来将 E QUEL程序转换为有效的 C 程序,并将 Q UEL语句转换为适当的 C 代码并调用 INGRES。然后,生成的 C 程序由普通 C 编译器编译,生成可执行模块。此外,当E QUEL程序运行时,C编译器产生的可执行模块被用作前端进程,代替交互式终端监视器,如图2所示。

In order to implement EQUEL, a translator (precompiler) was written to convert an EQUEL program into a valid C program with QUEL statements converted to appropriate C code and calls to INGRES. The resulting C program is then compiled by the normal C compiler, producing an executable module. Moreover, when an EQUEL program is run, the executable module produced by the C compiler is used as the front end process in place of the interactive terminal monitor, as noted in Figure 2.

在前端程序执行期间,数据库请求(E QUEL程序中的 Q UEL语句)通过管道 A 传递并由 INGRES 处理。请注意,未解析的 ASCII 字符串将传递给进程 2;[ 1 ]中给出了该决定背后的基本原理。如果必须在一次处理中返回元组,那么它们将通过进程 3 和 C 程序之间设置的特殊数据管道返回。还通过管道 F 返回条件代码以指示成功或遇到的错误类型。

During execution of the front end program, database requests (QUEL statements in the EQUEL program) are passed through pipe A and processed by INGRES. Note that unparsed ASCII strings are passed to process 2; the rationale behind this decision is given in [1]. If tuples must be returned for tuple at a time processing, then they are returned through a special data pipe set up between process 3 and the C program. A condition code is also returned through pipe F to indicate success or the type of error encountered.

图像

图2  分叉的进程结构

Figure 2  The forked process structure

[ 1 ]中详细讨论了E QUEL转换器执行的功能。

The functions performed by the EQUEL translator are discussed in detail in [1].

2.4  对流程结构的评论

2.4  Comments on the Process Structure

图1图2所示的流程结构是所实现的第四种不同的流程结构。以下考虑因素建议了这一最终选择:

The process structure shown in Figures 1 and 2 is the fourth different process structure implemented. The following considerations suggested this final choice:

(a) 地址空间限制。要在 11/40 上运行,必须遵守 64K 地址空间限制。进程 2 和 3 本质上是它们的最大大小;因此它们不能合并。由于大小限制,过程 4 中的代码存在多个重叠部分。

(a) Address space limitations. To run on an 11/40, the 64K address space limitation must be adhered to. Processes 2 and 3 are essentially their maximum size; hence they cannot be combined. The code in process 4 is in several overlays because of size constraints.

如果有较大的可用地址空间,进程 2、3 和 4 很可能会合并为一个大进程。然而,由于以下原因,3 个“核心”进程的必要性不应大幅降低性能。

Were a large address space available, it is likely that processes 2, 3, and 4 would be combined into a single large process. However, the necessity of 3 “core” processes should not degrade performance substantially for the following reasons.

如果一个大进程驻留在主内存中,则无需交换代码。然而,如果 UNIX 系统上有足够的可用实内存(约 300K 字节)来容纳进程 2 和 3 以及进程 4 的所有覆盖,则也不一定会发生代码交换。当然,此选项仅适用于 11/70。

If one large process were resident in main memory, there would be no necessity of swapping code. However, were enough real memory available (~300K bytes) on a UNIX system to hold processes 2 and 3 and all overlays of process 4, no swapping of code would necessarily take place either. Of course, this option is possible only on an 11/70.

另一方面,假设一个大型进程由支持虚拟内存的操作系统和硬件调入和调出主内存。人们认为,在这种情况下,页面错误将以与 INGRES 中进程交换/覆盖大致相同的速率生成 I/O 活动(假设在两种情况下可用的实际内存量相同)。

On the other hand, suppose one large process was paged into and out of main memory by an operating system and hardware which supported a virtual memory. It is felt that under such conditions page faults would generate I/O activity at approximately the same rate as the swapping/overlaying of processes in INGRES (assuming the same amount of real memory was available in both cases).

因此,由多个进程产生的唯一开销来源如下: (1) 读取或写入管道需要系统调用,这比子例程调用(可以在单进程系统中使用)昂贵得多。执行 INGRES 命令至少需要八个这样的系统调用。(2) 必须执行额外的代码来格式化信息管道上的传输。例如,不能通过管道传递指向数据结构的指针;必须线性化并通过整个结构。

Consequently the only sources of overhead that appear to result from multiple processes are the following: (1) Reading or writing pipes require system calls which are considerably more expensive than subroutine calls (which could be used in a single-process system). There are at least eight such system calls needed to execute an INGRES command. (2) Extra code must be executed to format information for transmission on pipes. For example, one cannot pass a pointer to a data structure through a pipe; one must linearize and pass the whole structure.

(b) 简单的控制流程。将功能分组为流程是出于对简单控制流的渴望。命令仅传递至右侧;数据和错误仅在左侧。进程3必须向进程4中的各个覆盖层发出命令;因此,它被放置在进程 4 的左侧。自然,解析器必须在进程 3 之前。

(b) Simple control flow. The grouping of functions into processes was motivated by the desire for simple control flow. Commands are passed only to the right; data and errors only to the left. Process 3 must issue commands to various overlays in process 4; therefore, it was placed to the left of process 4. Naturally, the parser must precede process 3.

以前的流程结构具有更复杂的流程互连。这使得同步和调试变得更加困难。

Previous process structures had a more complex interconnection of processes. This made synchronization and debugging much harder.

进程 4 的结构源于在单个进程中覆盖很少使用的代码的愿望。另一种选择是创建额外的进程 5、6 和 7(及其关联的管道),这些进程大部分时间都是静态的。这需要在 UNIX 核心表中添加空间,但没有真正的优势。

The structure of process 4 stemmed from a desire to overlay little-used code in a single process. The alternative would have been to create additional processes 5, 6, and 7 (and their associated pipes), which would be quiescent most of the time. This would have required added space in UNIX core tables for no real advantage.

这些进程都是同步的(即每个进程都等待右侧下一个进程的错误返回,然后继续接受左侧进程的输入),从而简化了控制流程。此外,在许多情况下,各个过程必须同步。INGRES 的未来版本可能会尝试尽可能利用并行性。目前尚不清楚这种并行性的性能回报。

The processes are all synchronized (i.e. each waits for an error return from the next process to the right before continuing to accept input from the process to the left), simplifying the flow of control. Moreover, in many instances the various processes must be synchronized. Future versions of INGRES may attempt to exploit parallelism where possible. The performance payoff of such parallelism is unknown at the present time.

(c)前端进程的隔离。出于保护原因,取代终端监视器作为前端的 C 程序必须使用与 INGRES 不同的用户 ID 运行。否则,它可能会直接篡改 INGRES 管理的数据。因此,它必须要么覆盖到一个进程中,要么在它自己的进程中运行。选择后者是因为效率和便利性。

(c) Isolation of the front end process. For reasons of protection the C program which replaces the terminal monitor as a front end must run with a user-id different from that of INGRES. Otherwise it could tamper directly with data managed by INGRES. Hence, it must be either overlayed into a process or run in its own process. The latter was chosen for efficiency and convenience.

(d) 两种流程结构的基本原理。交互式终端监视器可以用 E QUEL编写。这种策略可以避免存在两种仅在数据管道处理方面不同的进程结构。由于终端监视器是在 E QUEL存在之前编写的,因此无法遵循此选项。考虑到当前资源,在 E QUEL中重写终端监视器不被视为高优先级任务。此外,E QUEL监视器会稍微慢一些,因为合格的元组将返回到调用程序,然后显示,而不是由进程 3 直接显示。

(d) Rationale for two process structures. The interactive terminal monitor could have been written in EQUEL. Such a strategy would have avoided the existence of two process structures which differ only in the treatment of the data pipe. Since the terminal monitor was written prior to the existence of EQUEL, this option could not be followed. Rewriting the terminal monitor in EQUEL is not considered a high priority task given current resources. Moreover, an EQUEL monitor would be slightly slower because qualifying tuples would be returned to the calling program and then displayed rather than being displayed directly by process 3.

3  数据结构和访问方法

3  Data Structures and Access Methods

本节首先讨论 INGRES 操作的文件及其内容。然后我们指出五种可能的存储结构(文件格式)为了关系。最后,我们概述了用于与可用格式统一接口的访问方法语言。

We begin this section with a discussion of the files that INGRES manipulates and their contents. Then we indicate the five possible storage structures (file formats) for relations. Finally we sketch the access methods language used to interface uniformly to the available formats.

3.1   INGRES 文件结构

3.1  The INGRES File Structure

图 3显示了 INGRES 操作的 UNIX 文件系统的子树。该子树的根是为 UNIX 用户“INGRES”创建的目录。(最初安装 INGRES 系统时,必须创建这样一个用户。该用户因其可用的权限而被称为“超级用户”。这个主题将在 [28] 中进一步讨论。)该根目录有六个后代目录。AUX 目录具有包含控制进程生成的表的后代文件(如图 12所示))以及允许创建数据库的用户的授权列表。只有 INGRES 超级用户可以修改这些文件(通过使用 UNIX 编辑器)。BIN 和 SOURCE 是分别指示对象和源代码的后代文件的目录。TMP 的后代是交互式终端监视器使用的工作区的临时文件。DOC 是包含系统文档和参考手册的子树的根。最后,对于 INGRES 中存在的每个数据库,DATADIR 中都有一个目录条目。这些目录包含给定数据库中的数据库文件作为后代。

Figure 3 indicates the subtree of the UNIX file system that INGRES manipulates. The root of this subtree is a directory made for the UNIX user “INGRES.” (When the INGRES system is initially installed such a user must be created. This user is known as the “superuser” because of the powers available to him. This subject is discussed further in [28].) This root has six descendant directories. The AUX directory has descendant files containing tables which control the spawning of processes (shown in Figures 1 and 2) and an authorization list of users who are allowed to create databases. Only the INGRES superuser may modify these files (by using the UNIX editor). BIN and SOURCE are directories indicating descendant files of respectively object and source code. TMP has descendants which are temporary files for the workspaces used by the interactive terminal monitor. DOC is the root of a subtree with system documentation and the reference manual. Last, there is a directory entry in DATADIR for each database that exists in INGRES. These directories contain the database files in a given database as descendants.

这些数据库文件有四种类型:

These database files are of four types:

(一)行政档案。其中包含数据库管理员 (DBA) 的用户 ID 和初始化信息。

(a) Administration file. This contains the user-id of the database administrator (DBA) and initialization information.

(b) 目录(系统)关系。这些关系具有预定义的名称,并且是为每个数据库创建的。它们由 DBA 所有并构成系统目录。知识渊博的用户可以发出 RETRIEVE 语句来查询它们;但是,它们只能通过 INGRES 实用程序命令更新(或在紧急情况下直接由 INGRES 超级用户更新)。(当实施保护语句时,如果 DBA 愿意,他将能够有选择地限制 RETRIEVE 对这些关系的访问。)其中一些关系的形式和内容将在稍后讨论。

(b) Catalog (system) relations. These relations have predefined names and are created for every database. They are owned by the DBA and constitute the system catalogs. They may be queried by a knowledgeable user issuing RETRIEVE statements; however, they may be updated only by the INGRES utility commands (or directly by the INGRES superuser in an emergency). (When protection statements are implemented the DBA will be able to selectively restrict RETRIEVE access to these relations if he wishes.) The form and content of some of these relations will be discussed presently.

(c) DBA 关系。这些是 DBA 拥有的关系,并且是共享的,任何用户都可以访问它们。实施保护时,DBA 可以通过插入保护谓词(将在其中一个系统关系中并且对于每个用户来说可能是唯一的)来“授权”对这些关系的共享使用,并通过删除此类谓词来取消使用授权。这种机制在[ 28 ]中讨论。

(c) DBA relations. These are relations owned by the DBA and are shared in that any user may access them. When protection is implemented the DBA can “authorize” shared use of these relations by inserting protection predicates (which will be in one of the system relations and may be unique for each user) and deauthorize use by removing such predicates. This mechanism is discussed in [28].

(d) 其他关系。这些是由其他用户(通过 RETRIEVE INTO W 或 CREATE)创建的关系,并且不共享

(d) Other relations. These are relations created by other users (by RETRIEVE INTO W or CREATE) and are not shared.

此时应提出三点意见。

Three comments should be made at this time.

图像

图 3   INGRES 子树

Figure 3  The INGRES subtree

(a) DBA 拥有普通用户所没有的以下权力:创建共享关系并为其指定访问控制的能力;运行 PURGE 的能力;破坏数据库中任何关系的能力(系统目录除外)。

(a) The DBA has the following powers not available to ordinary users: the ability to create shared relations and to specify access control for them; the ability to run PURGE; the ability to destroy any relations in his database (except the system catalogs).

该系统允许“一级共享”,因为只有 DBA 拥有这些权力,并且他不能将其中任何权力委托给其他人(就像大多数分时系统的文件系统一样)。实施该策略的原因有三个:(1) 没有意识到增加通用性的必要性。此外,增加通用性会产生繁琐的问题(例如使访问权限的撤销变得非常重要)。(2) 将空间耗尽且某些关系必须销毁或归档时必须做出的政策决定委托给 DBA 似乎是适当的。如果数据库不受某个用户的控制,那么这一策略决策就会变得更加困难(或不可能)。(3) 必须委托某人来决定哪些关系被物理存储以及哪些关系被定义为“观点”。

This system allows “one-level sharing” in that only the DBA has these powers, and he cannot delegate any of them to others (as in the file systems of most time sharing systems). This strategy was implemented for three reasons: (1) The need for added generality was not perceived. Moreover, added generality would have created tedious problems (such as making revocation of access privileges nontrivial). (2) It seems appropriate to entrust to the DBA the duty (and power) to resolve the policy decision which must be made when space is exhausted and some relations must be destroyed or archived. This policy decision becomes much harder (or impossible) if a database is not in the control of one user. (3) Someone must be entrusted with the policy decision concerning which relations are physically stored and which are defined as “views.” This “database design” problem is best centralized in a single DBA.

(b) 除了每个数据库中的单个管理文件外,每个文件都被视为一个关系。将系统目录存储为关系具有以下优点: (1) 通过共享访问目录和数据关系的例程来节省代码。(2)由于支持多种存储结构以在各种交互混合下快速且灵活地访问数据关系,因此可以利用这些相同的存储选择来增强对目录信息的访问。(3) 执行 Q UEL语句以在必要时检查(和修补)系统关系的能力极大地帮助了系统调试。

(b) Except for the single administration file in each database, every file is treated as a relation. Storing system catalogs as relations has the following advantages: (1) Code is economized by sharing routines for accessing both catalog and data relations. (2) Since several storage structures are supported for accessing data relations quickly and flexibly under various interaction mixes, these same storage choices may be utilized to enhance access to catalog information. (3) The ability to execute QUEL statements to examine (and patch) system relations where necessary has greatly aided system debugging.

(c)每个关系都存储在单独的文件中,即不尝试对来自不同关系的元组进行“聚类”,这些元组可以在同一页面或附近页面上一起访问。

(c) Each relation is stored in a separate file, i.e. no attempt is made to “cluster” tuples from different relations which may be accessed together on the same or on a nearby page.

请清楚地注意,此集群类似于 DBTG 系统,声明要通过集合类型访问的记录类型,该集合类型将该记录类型的记录与不同记录类型的记录相关联。当前的 DBTG 实现通常尝试对这些关联记录进行物理集群。

Note clearly that this clustering is analogous to DBTG systems in declaring a record type to be accessed via a set type which associates records of that record type with a record of a different record type. Current DBTG implementations usually attempt to physically cluster these associated records.

另请注意,对给定文件中的一种关系进行聚类元组具有明显的性能影响。INGRES 支持的这种性质的聚类技术如第 3.3 节所示。

Note also that clustering tuples from one relation in a given file has obvious performance implications. The clustering techniques of this nature that INGRES supports are indicated in Section 3.3.

不对来自不同关系的元组进行聚类的决定基于以下推理。(1) UNIX 的页大小较小(512 字节)。因此,预计可以在同一页面上分组的元组数量很少。此外,UNIX 文件中逻辑上相邻的页不一定是物理上相邻的。因此,在“附近”页面上对元组进行聚类在 UNIX 中没有任何意义;文件中的下一个逻辑页可能比不同文件中的页更远(就磁盘臂运动而言)。与设计决策保持一致在修改 UNIX 时,这些考虑因素被纳入不支持集群的设计决策中。(2)如果支持聚类,访问方式会更加复杂。(3) 只有当关联的元组可以使用“集合”[ 6 ]、“链接”[ 29 ]或其他一些用于识别簇的方案链接在一起时,元组的聚类才有意义。将这些访问路径合并到分解方案中将大大增加其复杂性。

The decision not to cluster tuples from different relations is based on the following reasoning. (1) UNIX has a small (512-byte) page size. Hence it is expected that the number of tuples which can be grouped on the same page is small. Moreover, logically adjacent pages in a UNIX file are not necessarily physically adjacent. Hence clustering tuples on “nearby” pages has no meaning in UNIX; the next logical page in a file may be further away (in terms of disk arm motion) than a page in a different file. In keeping with the design decision of not modifying UNIX, these considerations were incorporated in the design decision not to support clustering. (2) The access methods would be more complicated if clustering were supported. (3) Clustering of tuples only makes sense if associated tuples can be linked together using “sets” [6], “links” [29], or some other scheme for identifying clusters. Incorporating these access paths into the decomposition scheme would have greatly increased its complexity.

应该指出的是,System R 的设计者对于聚类得出了不同的结论[ 2 ]。

It should be noted that the designers of System R have reached a different conclusion concerning clustering [2].

3.2  系统目录

3.2  System Catalogs

我们现在转向系统目录的讨论。我们详细讨论两种关系并简要说明其他关系的内容。

We now turn to a discussion of the system catalogs. We discuss two relations in detail and indicate briefly the contents of the others.

RELATION 关系对于数据库中的每个关系(包括所有系统关系)都包含一个元组。该关系的域是:

The RELATION relation contains one tuple for every relation in the database (including all the system relations). The domains of this relation are:

雷利德

relid

关系的名称。

the name of the relation.

所有者

owner

关系所有者的 UNIX 用户 ID;当附加到 relid 时,它会生成一个唯一的文件名来存储关系。

the UNIX user-id of the relation owner; when appended to relid it produces a unique file name for storing the relation.

规格

spec

指示五种可能的存储方案之一,或者指示虚拟关系(或“视图”)的特殊代码。

indicates one of five possible storage schemes or else a special code indicating a virtual relation (or “view”).

索引d

indexd

如果该关系存在二级索引,则设置标志。(此标志和以下两个标志的存在是为了通过在查询修改和单变量查询处理期间尽可能避免目录查找来提高性能。)

flag set if secondary index exists for this relation. (This flag and the following two are present to improve performance by avoiding catalog lookups when possible during query modification and one variable query processing.)

保护旗帜

protect flag

设置此关系是否具有保护谓词。

set if this relation has protection predicates.

积分

integ

如果存在完整性约束则设置标志。

flag set if there are integrity constraints.

节省

save

关系的预定生命周期。

scheduled lifetime of relation.

元组

tuples

相关元组的数量(通过下一节中讨论的例程“closer”保持最新)。

number of tuples in relation (kept up to date by the routine “closer” discussed in the next section).

属性

atts

相关域的数量。

number of domains in relation.

宽度

width

元组的宽度(以字节为单位)。

width (in bytes) of a tuple.

古板的

prim

该关系的主文件页数。

number of primary file pages for this relation.

ATTRIBUTE 目录包含与各个关系域相关的信息。ATTRIBUTE 目录的元组包含数据库中每个关系的每个域的以下项目:

The ATTRIBUTE catalog contains information relating to individual domains of relations. Tuples of the ATTRIBUTE catalog contain the following items for each domain of every relation in the database:

雷利德

relid

属性出现的关系的名称。

name of relation in which attribute appears.

所有者

owner

关系所有者。

relation owner.

域名

domain_name

域名。

domain name.

域名号

domain_no

相关的域号(位置)。在处理交互时,INGRES 使用此数字来引用此域。

domain number (position) in relation. In processing interactions INGRES uses this number to reference this domain.

抵消

offset

从元组开头到域开头的偏移量(以字节为单位)。

offset in bytes from beginning of tuple to beginning of domain.

类型

type

域的数据类型(整数、浮点或字符串)。

data type of domain (integer, floating point, or character string).

长度

length

域的长度(以字节为单位)。

length (in bytes) of domain.

基诺

keyno

如果该域是密钥的一部分,则“keyno”指示该域在密钥内的顺序。

if this domain is part of a key, then “keyno” indicates the ordering of this domain within the key.

这两个目录共同提供有关数据库中每个关系的结构和内容的信息。毫无疑问,随着系统的进一步开发,项目将继续添加或删除。第一个计划的扩展是域假定的最小值和最大值。这些将被正在开发的更复杂的分解方案所使用,该方案将在第 5 节中简要讨论,并在 [ 30 ]中详细讨论。将目录表示为关系使得这种重组非常容易发生。

These two catalogs together provide information about the structure and content of each relation in the database. No doubt items will continue to be added or deleted as the system undergoes further development. The first planned extensions are the minimum and maximum values assumed by domains. These will be used by a more sophisticated decomposition scheme being developed, which is discussed briefly in Section 5 and in detail in [30]. The representation of the catalogs as relations has allowed this restructuring to occur very easily.

存在其他几种系统关系,它们提供有关关系的辅助信息。INDEX 目录包含数据库中每个二级索引的元组。由于二级索引本身就是关系,因此它们在 RELATION 和 ATTRIBUTE 关系中独立编目。然而,INDEX目录提供主关系与其辅助索引之间的关联,并记录主关系的哪些域在索引中。

Several other system relations exist which provide auxiliary information about relations. The INDEX catalog contains a tuple for every secondary index in the database. Since secondary indices are themselves relations, they are independently cataloged in the RELATION and ATTRIBUTE relations. However, the INDEX catalog provides the association between a primary relation and its secondary indices and records which domains of the primary relation are in the index.

PROTECTION 和 INTEGRITY 目录分别包含数据库中每个关系的保护和完整性谓词。这些谓词以部分处理的形式存储为字符串。(该机制存在于 INTEGRITY 中,并且将以与 PROTECTION 相同的方式实现。)对于每个虚拟关系,VIEW 目录将包含根据现有关系对视图进行部分处理的类似 Q UEL 的描述最后三个目录的使用在第 4 节中描述。​​给定关系的任何辅助信息的存在都由 RELATION 目录中的适当标志来表示。

The PROTECTION and INTEGRITY catalogs contain respectively the protection and integrity predicates for each relation in the database. These predicates are stored in a partially processed form as character strings. (This mechanism exists for INTEGRITY and will be implemented in the same way for PROTECTION.) The VIEW catalog will contain, for each virtual relation, a partially processed QUEL-like description of the view in terms of existing relations. The use of these last three catalogs is described in Section 4. The existence of any of this auxiliary information for a given relation is signaled by the appropriate flag(s) in the RELATION catalog.

另一组系统关系由图形子系统用于编目和处理地图的系统关系组成,这些关系(与其他所有内容一样)作为关系存储在数据库中。该主题已在[ 13 ]中单独讨论。

Another set of system relations consists of those used by the graphics subsystem to catalog and process maps, which (like everything else) are stored as relations in the database. This topic has been discussed separately in [13].

3.3  可用的存储结构

3.3  Storage Structures Available

我们现在将描述 INGRES 当前可用的五种存储结构。其中四个方案是带密钥的,即文件中元组的存储位置是元组密钥域的值的函数。它们被称为“散列”、“ISAM”、“压缩散列”和“压缩 ISAM”。对于所有四种结构,密钥可以是任何有序的域集合。这些方案允许在提供键值时快速访问关系的特定部分。其余的非键控方案(“堆”)将元组独立于其值存储在文件中,并提供低开销的存储结构,在需要完整扫描关系的情况下尤其有吸引力。

We will now describe the five storage structures currently available in INGRES. Four of the schemes are keyed, i.e. the storage location of a tuple within the file is a function of the value of the tuple’s key domains. They are termed “hashed,” “ISAM,” “compressed hash,” and “compressed ISAM.” For all four structures the key may be any ordered collection of domains. These schemes allow rapid access to specific portions of a relation when key values are supplied. The remaining nonkeyed scheme (a “heap”) stores tuples in the file independently of their values and provides a low overhead storage structure, especially attractive in situations requiring a complete scan of the relation.

INGRES 中的非键控存储结构是随机排序的顺序文件。固定长度元组只是按照提供的顺序依次放置在文件中。添加到关系中的新元组仅附加到文件末尾。每个元组的唯一元组标识符是其在文件中的字节偏移量。这种模式主要用于(a)非常小的关系,对于这种关系,其他方案的开销是没有必要的;(b) 通过 COPY 移入或移出系统的数据的过渡存储;(c) 在查询处理期间作为中间结果创建的某些临时关系。

The nonkeyed storage structure in INGRES is a randomly ordered sequential file. Fixed length tuples are simply placed sequentially in the file in the order supplied. New tuples added to the relation are merely appended to the end of the file. The unique tuple identifier for each tuple is its byte-offset within the file. This mode is intended mainly for (a) very small relations, for which the overhead of other schemes is unwarranted; (b) transitional storage of data being moved into or out of the system by COPY; (c) certain temporary relations created as intermediate results during query processing.

在其余四种方案中,元组的键值决定了元组将被放置在文件的页面上。这些方案共享一个共同的“页面结构”来管理文件页面上的元组,如图 4所示。

In the remaining four schemes the key-value of a tuple determines the page of the file on which the tuple will be placed. The schemes share a common “page-structure” for managing tuples on file pages, as shown in Figure 4.

元组必须完全适合单个页面。其唯一元组标识符 (TID) 由页号(UNIX 文件中页的顺序)和行号组成。行号是行表的索引,行表从页面底部向上增长,其条目包含指向页面上元组的指针。通过这种方式,可以重新组织页面上元组的物理排列,而不会影响 TID。

A tuple must fit entirely on a single page. Its unique tuple identifier (TID) consists of a page number (the ordering of its page in the UNIX file) plus a line number. The line number is an index into a line table, which grows upward from the bottom of the page, and whose entries contain pointers to the tuples on the page. In this way the physical arrangement of tuples on a page can be reorganized without affecting TIDs.

最初,该文件在多个主页面上包含其所有元组。如果关系增长并且这些页面填满,则溢出页面将被分配并通过指向与其关联的主页面的指针链接。在链接的页面组内,不维护元组的特殊顺序。因此,在定位特定主页面的键控访问中,与键匹配的元组实际上可能出现在链中的任何页面上。

Initially the file contains all its tuples on a number of primary pages. If the relation grows and these pages fill, overflow pages are allocated and chained by pointers to the primary pages with which they are associated. Within a chained group of pages no special ordering of tuples is maintained. Thus in a keyed access which locates a particular primary page, tuples matching the key may actually appear on any page in the chain.

正如[ 16 ]中所讨论的,使用了两种密钥到地址转换模式——随机化(或“散列”)和顺序保留。在“散列”文件中,元组根据键上的散列函数随机分布在文件的主要页面中。此模式非常适合需要根据特定键值进行访问的情况。

As discussed in [16], two modes of key-to-address transformation are used—randomizing (or “hashing”) and order preserving. In a “hash” file tuples are distributed randomly throughout the primary pages of the file according to a hashing function on a key. This mode is well suited for situations in which access is to be conditioned on a specific key value.

作为保序模式,使用类似于IBM的ISAM[ 18 ]的方案。对关系进行排序以产生特定键上的排序。创建一个多级目录,记录每个主页上的高键。该目录是静态的,位于文件本身主页面之后的几个页面上。主页面及其溢出页面按排序顺序维护。该决定将在第 4.2 节中讨论。“类 ISAM”模式在键值可能被指定为落入某个值范围内的情况下非常有用,因为保留了键的近似排序。[ 16 ]中讨论的索引压缩方案目前正在实施中。

As an order preserving mode, a scheme similar to IBM’s ISAM [18] is used. The relation is sorted to produce the ordering on a particular key. A multilevel directory is created which records the high key on each primary page. The directory, which is static, resides on several pages following the primary pages within the file itself. A primary page and its overflow pages are not maintained in sort order. This decision is discussed in Section 4.2. The “ISAM-like” mode is useful in cases where the key value is likely to be specified as falling within a range of values, since a near ordering of the keys is preserved. The index compression scheme discussed in [16] is currently under implementation.

图像

图 4  键控存储结构的页面布局

Figure 4  Page layout for keyed storage structures

在上述键控模式中,存储固定长度的元组。此外,在增加的存储利用率超过访问期间编码和解码数据的额外成本的情况下,这两种方案都可以与数据压缩技术[ 14 ]结合使用。这些模式称为“压缩哈希”和“压缩 ISAM”。

In the above-mentioned keyed modes, fixed length tuples are stored. In addition, both schemes can be used in conjunction with data compression techniques [14] in cases where increased storage utilization outweighs the added cost of encoding and decoding data during access. These modes are known as “compressed hash” and “compressed ISAM.”

当前的压缩方案抑制元组中与前一个元组匹配的空白和部分。这种压缩独立地应用于每个页面。其他方案正在试验中。压缩似乎对于存储可变长度域(必须声明其最大长度)很有用。然后在压缩期间通过访问方法删除填充。存储二级索引时压缩也可能很有用。

The current compression scheme suppresses blanks and portions of a tuple which match the preceding tuple. This compression is applied to each page independently. Other schemes are being experimented with. Compression appears to be useful in storing variable length domains (which must be declared their maximum length). Padding is then removed during compression by the access method. Compression may also be useful when storing secondary indices.

3.4  访问方法接口

3.4  Access Methods Interface

访问方法接口 (AMI) 处理关系中数据的所有实际访问。AMI 语言被实现为一组函数,其调用约定如下所示。这些函数的单独副本随进程 2、3 和 4 中的每一个加载。

The Access Methods Interface (AMI) handles all actual accessing of data from relations. The AMI language is implemented as a set of functions whose calling conventions are indicated below. A separate copy of these functions is loaded with each of processes 2, 3, and 4.

每个访问方法必须做两件事来支持以下调用。首先,它必须提供关系中元组的某种线性排序,以便很好地定义“下一个元组”的概念。其次,它必须为每个元组分配一个唯一的元组 ID (TID)。

Each access method must do two things to support the following calls. First, it must provide some linear ordering of the tuples in a relation so that the concept of “next tuple” is well defined. Second, it must assign to each tuple a unique tuple-id (TID).

九个实现的调用如下:

The nine implemented calls are as follows:

(a) OPENR(描述符、模式、关系名称)

(a)  OPENR(descriptor, mode, relation_name)

在访问关系之前,必须先“打开”它。此函数打开关系的 UNIX 文件,并使用 RELATION 和 ATTRIBUTE 目录中的关系信息填充“描述符”。描述符(必须在调用例程中声明其存储)在 AMI 例程的后续调用中用作输入参数,以指示涉及哪个关系。因此,AMI 数据访问例程本身不需要检查系统目录中的关系描述。“模式”指定打开关系是为了更新还是仅为了检索。

Before a relation may be accessed it must be “opened.” This function opens the UNIX file for the relation and fills in a “descriptor” with information about the relation from the RELATION and ATTRIBUTE catalogs. The descriptor (storage for which must be declared in the calling routine) is used in subsequent calls on AMI routines as an input parameter to indicate which relation is involved. Consequently, the AMI data accessing routines need not themselves check the system catalogs for the description of a relation. “Mode” specifies whether the relation is being opened for update or for retrieval only.

(b) GET(描述符、tid、limit_tid、元组、next_flag)

(b)  GET(descriptor, tid, limit_tid, tuple, next_flag)

该函数检索“元组”,即来自“描述符”指示的关系的单个元组。“Tid”和“limit_tid”是元组标识符。检索方式有两种:“扫描”和“直接”。在“扫描”模式下,GET 旨在连续调用以检索 tuple-id 范围内的所有元组。“tid”的初始值设置所需范围的下限,“limit_tid”设置上限。每次使用“next-flag” = TRUE 调用 GET 时,都会检索“tid”后面的元组,并将其 tuple-id 放入“tid”中,为下一次调用做好准备。达到“limit_tid”由特殊的返回码指示,“tid”和“limit_tid”的初始设置是通过调用FIND函数完成的。在“直接”模式下(“next_flag”= FALSE),GET 检索 tuple-id =“tid”的元组。

This function retrieves into “tuple,” a single tuple from the relation indicated by “descriptor.” “Tid” and “limit_tid” are tuple identifiers. There are two modes of retrieval, “scan” and “direct.” In “scan” mode GET is intended to be called successively to retrieve all tuples within a range of tuple-ids. An initial value of “tid” sets the low end of the range desired and “limit_tid” sets the high end. Each time GET is called with “next-flag” = TRUE, the tuple following “tid” is retrieved and its tuple-id is placed into “tid” in readiness for the next call. Reaching “limit_ tid” is indicated by a special return code, The initial settings of “tid” and “limit_tid” are done by calling the FIND function. In “direct” mode (“next_flag” = FALSE), GET retrieves the tuple with tuple-id = “tid.”

(c) FIND(描述符、键、tid、键类型)

(c)  FIND(descriptor, key, tid, key_type)

当使用负“键类型”调用时,FIND 在“tid”中返回最低页上的最低元组 ID,该页可能包含与所提供的键匹配的元组。类似地,当“key-type”为正时,返回最高的 tuple-id。目的是通过从考虑中消除元组来限制关系的扫描,这些元组从其位置已知不满足给定的条件。

When called with a negative “key-type,” FIND returns in “tid” the lowest tuple-id on the lowest page which could possibly contain tuples matching the key supplied. Analogously, the highest tuple-id is returned when “key-type” is positive. The objective is to restrict the scan of a relation by eliminating tuples from consideration which are known from their placement not to satisfy a given qualification.

“Key-type”还指示(通过其绝对值)该键(如果提供)是 EXACTKEY 还是 RANGEKEY。每种情况都应用不同的匹配标准。EXACTKEY 仅匹配那些完全包含所提供的键值的元组。RANGEKEY 表示可能的键值范围的低端(或高端),因此与键值大于或等于(或小于或等于)所提供的键的任何元组匹配。请注意,只有使用保留顺序的存储结构才能使用 RANGEKEY 来成功限制扫描。

“Key-type” also indicates (through its absolute value) whether the key, if supplied, is an EXACTKEY or a RANGEKEY. Different criteria for matching are applied in each case. An EXACTKEY matches only those tuples containing exactly the value of the key supplied. A RANGEKEY represents the low (or high) end of a range of possible key values and thus matches any tuple with a key value greater than or equal to (or less than or equal to) the key supplied. Note that only with an order preserving storage structure can a RANGEKEY be used to successfully restrict a scan.

如果关系的存储结构与“键类型”不兼容,则返回的“tid”将如同未提供键一样(即关系中的最低或最高元组)。对 FIND 的调用总是成对发生,以获得两个元组 ID,这两个元组 ID 确定了后续 GET 调用中完成的扫描的低端和高端。

In cases where the storage structure of the relation is incompatible with the “key-type,” the “tid” returned will be as if no key were supplied (that is, the lowest or highest tuple in the relation). Calls to FIND invariably occur in pairs, to obtain the two tuple-ids which establish the low and high ends of the scan done in subsequent calls to GET.

有两个函数可分别用于确定主数据关系或辅助索引的存储结构的访问特性。

Two functions are available for determining the access characteristics of the storage structure of a primary data relation or secondary index, respectively.

(d) PARAMD(描述符,access_characteristics_struct)

(d)  PARAMD (descriptor, access_characteristics_structure)

(e) PARAMI(索引描述符,access_characteristics_struct)

(e)  PARAMI (index-descriptor, access_characteristics_structure)

“访问特性结构”填充有关键类型的信息,可用于限制对给定关系的扫描:它指示是否可以使用精确的键值或键值的范围,以及是否可以使用部分键值。可以使用指定的键。这决定了后续调用 FIND 时使用的“keytype”。还指示了密钥中域的顺序。这两个函数允许访问优化例程的编码独立于当前实现的特定存储结构。

The “access-characteristics-structure” is filled in with information regarding the type of key which may be utilized to restrict the scan of a given relation: It indicates whether exact key values or ranges of key values can be used, and whether a partially specified key may be used. This determines the “keytype” used in a subsequent call to FIND. The ordering of domains in the key is also indicated. These two functions allow the access optimization routines to be coded independently of the specific storage structures currently implemented.

其他 AMI 功能提供更新关系的工具。

Other AMI functions provide a facility for updating relations.

(f) INSERT(描述符,元组)

(f)  INSERT(descriptor, tuple)

根据元组的键值和关系的存储模式,将元组添加到关系的“适当”位置。

The tuple is added to the relation in its “proper” place according to its key value and the storage mode of the relation.

(g) REPLACE(描述符、tid、new_tuple)

(g)  REPLACE(descriptor, tid, new_tuple)

(h) DELETE(描述符,tid)

(h)  DELETE(descriptor, tid)

“tid”指示的元组要么被新值替换,要么从关系中完全删除。受影响元组的 tuple-id 将通过先前的 GET 获得。

The tuple indicated by “tid” is either replaced by new values or deleted from the relation altogether. The tuple-id of the affected tuple will have been obtained by a previous GET.

最后,当对关系的所有访问完成后,必须将其关闭:

Finally, when all access to a relation is complete it must be closed:

(i) CLOSER(描述符)

(i)  CLOSER(descriptor)

这将关闭关系的 UNIX 文件,并将描述符中的信息重写回系统目录(如果有任何更改)。

This closes the relation’s UNIX file and rewrites the information in the descriptor back into the system catalogs if there has been any change.

3.5  增加新的访问方式

3.5  Addition of New Access Methods

AMI 设计的目标之一是将更高级别的软件与访问方法的实际功能隔离,从而更容易添加不同的软件。预计有特殊要求的用户将利用此功能。

One of the goals of the AMI design was to insulate higher level software from the actual functioning of the access methods, thereby making it easier to add different ones. It is anticipated that users with special requirements will take advantage of this feature.

为了添加新的访问方法,只需扩展 AMI 例程即可处理新情况。如果新方法使用相同的页面布局和TID方案,则仅需要扩展FIND、PARAMI和PARAMD。否则,必须提供新的过程来执行 TID 到物理文件位置的映射,以供 GET、INSERT、REPLACE 和 DELETE 使用。

In order to add a new access method, one need only extend the AMI routines to handle the new case. If the new method uses the same page layout and TID scheme, only FIND, PARAMI, and PARAMD need to be extended. Otherwise new procedures to perform the mapping of TIDs to physical file locations must be supplied for use by GET, INSERT, REPLACE, and DELETE.

4  流程2的结构

4  The Structure of Process 2

流程 2 包含四个主要组成部分:

Process 2 contains four main components:

(a) 词法分析器;

(a)  a lexical analyzer;

(b) 解析器(用 YACC [ 19 ] 编写);

(b)  a parser (written in YACC [19]);

(c)并发控制例程;

(c)  concurrency control routines;

(d) 支持保护、视图和完整性控制的查询修改例程(目前仅部分实现)。

(d)  query modification routines to support protection, views, and integrity control (at present only partially implemented).

由于(a)和(b)是按照相当标准的路线设计和实现的,因此仅详细讨论(c)和(d)。解析过程的输出是输入查询的树结构表示,用作后续处理中的内部形式。此外,查询的限定部分已转换为连接范式的等效布尔表达式。在这种形式中,查询树准备好进行所谓的“查询修改”。

Since (a) and (b) are designed and implemented along fairly standard lines, only (c) and (d) will be discussed in detail. The output of the parsing process is a tree structured representation of the input query used as the internal form in subsequent processing. Furthermore, the qualification portion of the query has been converted to an equivalent Boolean expression in conjunctive normal form. In this form the query tree is then ready to undergo what has been termed “query modification.”

4.1  查询修改

4.1  Query Modification

查询修改包括向原始查询添加完整性和保护谓词以及将对虚拟关系的引用更改为对适当物理关系的引用。目前仅实施了一个简单的完整性方案。

Query modification includes adding integrity and protection predicates to the original query and changing references to virtual relations into references to the appropriate physical relations. At the present time only a simple integrity scheme has been implemented.

在[ 27 ]中,提出了几个复杂级别的算法,用于对更新执行完整性控制。在本系统中,仅实现了最简单的情况,涉及单变量、聚合自由完整性断言,如[ 23 ]中详细描述的。

In [27] algorithms of several levels of complexity are presented for performing integrity control on updates. In the present system only the simplest case, involving single-variable, aggregate free integrity assertions, has been implemented, as described in detail in [23].

简而言之,完整性断言以 Q UEL限定子句的形式输入,以应用于更新断言中的变量所依赖的关系的交互。为限定而创建解析树,并且该树的表示与关系和所涉及的特定域的指示一起存储在 INTEGRITY 目录中。在查询修改时,将检查更新以查找受影响域上任何可能的完整性断言。检索相关断言,将其重建为树形式,并嫁接到更新树上,以便将断言与现有的交互资格进行“与”操作。

Briefly, integrity assertions are entered in the form of QUEL qualification clauses to be applied to interactions updating the relation over which the variable in the assertion rangrs. A parse tree is created for the qualification and a representation of this tree is stored in the INTEGRITY catalog together with an indication of the relation and the specific domains involved. At query modification time, updates are checked for any possible integrity assertions on the affected domains. Relevant assertions are retrieved, rebuilt into tree form, and grafted onto the update tree so as to AND the assertions with the existing qualification of the interaction.

[ 27 ]中还给出了支持观点的算法。基本上,视图是根据物理存在的关系定义的虚拟关系。仅存储视图定义,并通过 DEFINE 命令将其指示给 INGRES。该命令的语法与 RETRIEVE 语句的语法相同。因此,法律视图将是那些可以通过 RETRIEVE 语句实现的关系。它们将被允许在 INGRES 中支持为过时的数据库版本编写的 EQUEL 程序以及为了用户方便而编写的程序。

Algorithms for the support of views are also given in [27]. Basically a view is a virtual relation defined in terms of relations which physically exist. Only the view definition will be stored, and it will be indicated to INGRES by a DEFINE command. This command will have a syntax identical to that of a RETRIEVE statement. Thus legal views will be those relations which it is possible to materialize by a RETRIEVE statement. They will be allowed in INGRES to support EQUEL programs written for obsolete versions of the database and for user convenience.

保护将根据[ 25 ]中描述的算法进行处理。与完整性控制一样,该算法涉及向用户交互添加资格。[ 28 ]中给出了实现的细节(正在进行中),其中还包括对正在实现的机制的讨论,以物理方式保护 INGRES 文件免受除执行 INGRES 目标代码之外的任何方式的篡改。最后,[ 28 ]将 INGRES 保护方案与基于[ 5 ]中的观点的保护方案区分开来,并指出了其使用背后的基本原理。

Protection will be handled according to the algorithm described in [25]. Like integrity control, this algorithm involves adding qualifications to the user’s interaction. The details of the implementation (which is in progress) are given in [28], which also includes a discussion of the mechanisms being implemented to physically protect INGRES files from tampering in any way other than by executing the INGRES object code. Last, [28] distinguishes the INGRES protection scheme from the one based on views in [5] and indicates the rationale behind its use.

在本节的其余部分中,我们将给出一个查询修改的示例。

In the remainder of this section we give an example of query modification at work.

假设在之前的某个时间点,EMPLOYEE 关系中的所有员工都在 30 岁以下,并且没有经理记录。如果为 EMPLOYEE 的早期版本编写E QUEL程序,该程序检索编码为 5 位的员工年龄,那么现在对于 31 岁以上的员工,该程序将失败。

Suppose at a previous point in time all employees in the EMPLOYEE relation were under 30 and had no manager recorded. If an EQUEL program had been written for this previous version of EMPLOYEE which retrieved ages of employees coded into 5 bits, it would now fail for employees over 31.

如果希望不加修改地使用上述程序,则必须使用以下视图:

If one wishes to use the above program without modification, then the following view must be used:

E 的范围是雇员

RANGE OF E IS EMPLOYEE

定义 OLDEMP(E.NAME、E.DEPT、E.SALARY、E.AGE)

DEFINE OLDEMP (E.NAME, E.DEPT, E.SALARY, E.AGE)

当 E.AGE < 30 时

WHERE E.AGE < 30

假设 EMPLOYEE 关系中的所有员工的收入必须超过 8000 美元。这可以用完整性约束来表示:

Suppose that all employees in the EMPLOYEE relation must make more than $8000. This can be expressed by the integrity constraint:

E 的范围是雇员

RANGE OF E IS EMPLOYEE

诚信约束是 E.SALARY > 8000

INTEGRITY CONSTRAINT IS E.SALARY > 8000

最后,假设每个人只被授权改变他所​​管理的员工的工资。这表达如下:

Last, suppose each person is only authorized to alter salaries of employees whom he manages. This is expressed as follows:

E 的范围是雇员

RANGE OF E IS EMPLOYEE

保护所有人的员工(电子工资;电子姓名)

PROTECT EMPLOYEE FOR ALL (E.SALARY; E.NAME)

其中 E.MANAGER = *

WHERE E.MANAGER = *

* 是 INGRES 当前 UNIX 用户登录名的代理。分号将可更新域与不可更新(但可见)域分隔开。

The * is a surrogate for the logon name of the current UNIX user of INGRES. The semicolon separates updatable from nonupdatable (but visible) domains.

假设 Smith 通过 E QUEL程序或从终端监视器发出以下交互:

Suppose Smith through an EQUEL program or from the terminal monitor issues the following interaction:

L 的范围是 oldemp

RANGE OF L IS OLDEMP

替换 L(工资 = .9*L.工资)

REPLACE L(SALARY = .9*L.SALARY)

WHERE L.NAME =“棕色”

WHERE L.NAME = “Brown”

这是视图的更新。因此,[ 27 ]中的视图算法将首先应用于yield:

This is an update on a view. Hence the view algorithm in [27] will first be applied to yield:

E 的范围是雇员

RANGE OF E IS EMPLOYEE

替换 E(工资 = .9*E.工资)

REPLACE E(SALARY = .9*E.SALARY)

WHERE E.NAME =“棕色”

WHERE E.NAME = “Brown”

并且 E.年龄 < 30

AND E.AGE < 30

注意,只有年龄在 30 岁以下的 Brown 才处于 OLDEMP 中。现在必须应用 [ 27 ]中的完整性算法来确保 Brown 的工资不会被削减到低至 8000 美元。这涉及将交互修改为:

Note Brown is only in OLDEMP if he is under 30. Now the integrity algorithm in [27] must be applied to ensure that Brown’s salary is not being cut to as little as $8000. This involves modifying the interaction to:

E 的范围是雇员

RANGE OF E IS EMPLOYEE

替换 E(工资 = .9*E.工资)

REPLACE E(SALARY = .9*E.SALARY)

WHERE E.NAME =“棕色”

WHERE E.NAME = “Brown”

 并且 E.年龄 < 30

 AND E.AGE < 30

 并且 .9*E.Salary > $8000

 AND .9*E.SALARY > $8000

由于更新后布朗的工资将是 .9*E.SALARY,因此增加的资格确保这将超过 8000 美元。

Since .9*E.SALARY will be Brown’s salary after the update, the added qualification ensures this will be more than $8000.

最后,应用[ 28 ]的保护算法得到:

Last, the protection algorithm of [28] is applied to yield:

E 的范围是雇员

RANGE OF E IS EMPLOYEE

替换 E(工资 = .9*E.工资)

REPLACE E(SALARY = .9*E.SALARY)

WHERE E.NAME =“棕色”

WHERE E.NAME = “Brown”

 并且 E.年龄 < 30

 AND E.AGE < 30

 并且 .9*E.Salary > $8000

 AND .9*E.SALARY > $8000

 AND E.MANAGER = “史密斯”

 AND E.MANAGER = “Smith”

请注意,在所有三种情况下,更多的资格都与用户的交互进行了“与”运算。视图算法还必须更改元组变量。

Notice that in all three cases more qualification is ANDed onto the user’s interaction. The view algorithm must in addition change tuple variables.

在所有情况下,限定都是从存储在 VIEW、INTEGRITY 和 PROTECTION 关系中的谓词获得(或者是对其的简单修改)。交互的树表示被简单地修改为 AND 这些限定(全部以解析的形式存储)。

In all cases the qualification is obtained from (or is an easy modification of) predicates stored in the VIEW, INTEGRITY, and PROTECTION relations. The tree representation of the interaction is simply modified to AND these qualifications (which are all stored in parsed form).

应该明确指出的是,当前仅支持单变量聚合自由完整性断言。而且,即使这个功能在 INGRES 的发布版本中也没有。并发控制和完整性控制的代码在不超过 64K 字的情况下将不适合进程 2。决定发布一个具有并发控制的系统。

It should be clearly noted that only one-variable, aggregate free integrity assertions are currently supported. Moreover, even this feature is not in the released version of INGRES. The code for both concurrency control and integrity control will not fit into process 2 without exceeding 64K words. The decision was made to release a system with concurrency control.

INGRES 设计者目前正在添加第五个进程(进程 2.5)来保持并发和查询修改例程。在具有 128K 地址空间的 PDP 11/45 和 11/70 上,不需要此额外过程。

The INGRES designers are currently adding a fifth process (process 2.5) to hold concurrency and query modification routines. On PDP 11/45s and 11/70s that have a 128K address space this extra process will not be required.

4.2  并发控制

4.2  Concurrency Control

在任何多用户系统中,必须包括一些规定,以确保以某种方式执行多个并发更新,从而可以保证一定程度的数据完整性。下面的两个更新说明了这个问题。

In any multiuser system provisions must be included to ensure that multiple concurrent updates are executed in a manner such that some level of data integrity can be guaranteed. The following two updates illustrate the problem.

E 的范围是雇员

RANGE OF E IS EMPLOYEE

U1

U1

替换 E(DEPT =“玩具”)

REPLACE E(DEPT = “toy”)

 其中 E.DEPT =“糖果”

 WHERE E.DEPT = “candy”

F 的范围是雇员

RANGE OF F IS EMPLOYEE

U2

U2

替换 F(DEPT = “糖果”)

REPLACE F(DEPT = “candy”)

其中 F.DEPT =“玩具”

WHERE F.DEPT = “toy”

如果 U1 和 U2 在没有控制的情况下同时执行,则某些员工可能会在每个部门中结束,并且如果备份数据库并重新执行交互,则特定结果可能无法重复。

If U1 and U2 are executed concurrently with no controls, some employees may end up in each department and the particular result may not be repeatable if the database is backed up and the interactions reexecuted.

必须提供的控制是为了保证某些数据库操作是“原子的”(以这样一种方式发生,即它看起来是瞬时的并且早于或在任何其他数据库操作之后)。这个原子单位将被称为“交易”。

The control which must be provided is to guarantee that some database operation is “atomic” (occurs in such a fashion that it appears instantaneous and before or after any other database operation). This atomic unit will be called a “transaction.”

在 INGRES 中,有五种基本选项可用于定义事务:

In INGRES there are five basic choices available for defining a transaction:

(a) 小于一个 INGRES 命令的命令;

(a)  something smaller than one INGRES command;

(b) 一个 INGRES 命令;

(b)  one INGRES command;

(c) 没有插入 C 代码的 INGRES 命令集合;

(c)  a collection of INGRES commands with no intervening C code;

(d) 带有 C 代码但没有系统调用的 INGRES 命令集合;

(d)  a collection of INGRES commands with C code but no system calls;

(e) 任意 E QUEL程序。

(e)  an arbitrary EQUEL program.

如果选择选项 (a),INGRES 无法保证两个同时执行的更新命令将给出相同的结果,就好像它们在一个 INGRES 进程集合中按顺序(以任一顺序)执行一样。事实上,如上例所示,结果可能无法重复。这种情况显然是不可取的。

If option (a) is chosen, INGRES could not guarantee that two concurrently executing update commands would give the same result as if they were executed sequentially (in either order) in one collection of INGRES processes. In fact, the outcome could fail to be repeatable, as noted in the example above. This situation is clearly undesirable.

INGRES 设计者认为,选项 (e) 是不可能支持的。可以在 E QUEL程序中声明以下事务。

Option (e) is, in the opinion of the INGRES designers, impossible to support. The following transaction could be declared in an EQUEL program.

开始交易

BEGIN TRANSACTION

第一次更新

FIRST QUEL UPDATE

创建和销毁文件的系统调用

SYSTEM CALLS TO CREATE AND DESTROY FILES

系统调用 fork INGRES 进程的第二个集合

SYSTEM CALLS TO FORK A SECOND COLLECTION OF INGRES PROCESSES

  命令被传递到哪些地方

  TO WHICH COMMANDS ARE PASSED

从终端读取的系统调用

SYSTEM CALLS TO READ FROM A TERMINAL

从磁带读取的系统调用

SYSTEM CALLS TO READ FROM A TAPE

SECOND QUEL UPDATE(其形式取决于前两个系统调用)

SECOND QUEL UPDATE (whose form depends on previous two system calls)

结束交易

END TRANSACTION

假设 T1 是上述事务,并且与涉及相同形式命令的事务 T2 并发运行。每个事务的第二次更新很可能与另一个事务的第一次更新发生冲突。请注意,无法先验地告知 T1 和 T2 冲突,因为事先不知道第二次更新的形式。因此,可能会出现死锁情况,只能通过中止一项事务(在 INGRES 设计者看来,这是一种不良策略)或尝试取消一项事务来解决。通过中间系统调用退出的开销似乎令人望而却步(如果可能的话)。

Suppose T1 is the above transaction and runs concurrently with a transaction T2 involving commands of the same form. The second update of each transaction may well conflict with the first update of the other. Note that there is no way to tell a priori that T1 and T2 conflict, since the form of the second update is not known in advance. Hence a deadlock situation can arise which can only be resolved by aborting one transaction (an undesirable policy in the eyes of the INGRES designers) or attempting to back out one transaction. The overhead of backing out through the intermediate system calls appears prohibitive (if it is possible at all).

限制事务没有系统调用(因此没有 I/O)会削弱事务的能力,从而使死锁解决成为可能。这被认为是不合需要的。

Restricting a transaction to have no system calls (and hence no I/O) cripples the power of a transaction in order to make deadlock resolution possible. This was judged undesirable.

例如,以下事务需要这样的系统调用:

For example, the following transaction requires such system calls:

开始交易

BEGIN TRANSACTION

QUEL RETRIEVE 查找特定日期从 旧金山 飞往 洛杉矶 的所有航班

QUEL RETRIEVE to find all flights on a particular day from San Francisco to Los

  安吉利斯有可用空间。

  Angeles with space available.

向用户显示航班和时间。

Display flights and times to user.

等待用户指示所需的航班。

Wait for user to indicate desired flight.

QUEL REPLACE 预订用户选择的航班上的座位。

QUEL REPLACE to reserve a seat on the flight of the user’s choice.

结束交易

END TRANSACTION

如果上述命令集不是事务,则执行 REPLACE 时航班上的空间可能不可用,即使发生 RETRIEVE 时也是如此。

If the above set of commands is not a transaction, then space on a flight may not be available when the REPLACE is executed even though it was when the RETRIEVE occurred.

由于似乎不可能支持多 QUEL 语句事务(除了残缺的形式),INGRES 设计者选择了选项 (b),即一个 QUEL 语句作为事务。

Since it appears impossible to support multi-QUEL statement transactions (except in a crippled form), the INGRES designers have chosen Option (b), one QUEL statement, as a transaction.

选项(c)可以通过简单扩展所遵循的算法来处理,并且如果有足够的用户需求,则将实现该选项。该选项可以支持“触发器”[ 2 ]并且可能有用。

Option (c) can be handled by a straightforward extension of the algorithms to follow and will be implemented if there is sufficient user demand for it. This option can support “triggers” [2] and may prove useful.

支持选项(d)将大大增加系统的复杂性,因为这被认为是一个小概括。此外,除非翻译器解析整个 C 语言,否则很难在 E QUEL翻译器中执行。

Supporting Option (d) would considerably increase system complexity for what is perceived to be a small generalization. Moreover, it would be difficult to enforce in the EQUEL translator unless the translator parsed the entire C language.

(b)或(c)的实现可以通过数据项、页、元组、域、关系等上的物理锁来实现[12]通过谓词锁[ 26 ]。当前的实现是通过相对粗糙的物理锁(在关系的域上),并通过不允许交互继续进行到进程 3 直到它可以锁定所有所需资源来避免死锁。由于 REPLACE 访问方法调用的当前设计存在问题,当前必须锁定关系的所有域(即锁定整个关系)才能执行更新。这种情况很快就会得到纠正。

The implementation of (b) or (c) can be achieved by physical locks on data items, pages, tuples, domains, relations, etc. [12] or by predicate locks [26]. The current implementation is by relatively crude physical locks (on domains of a relation) and avoids deadlock by not allowing an interaction to proceed to process 3 until it can lock all required resources. Because of a problem with the current design of the REPLACE access method call, all domains of a relation must currently be locked (i.e. a whole relation is locked) to perform an update. This situation will soon be rectified.

选择避免死锁而不是检测和解决死锁主要是为了实现简单性。

The choice of avoiding deadlock rather than detecting and resolving it is made primarily for implementation simplicity.

粗略锁定单元的选择反映了我们的环境,其中大型锁定表的核心存储不可用。我们当前的实现使用 LOCK 关系,其中插入了每个请求的锁的元组。整个关系被物理锁定,然后询问冲突的锁。如果不存在,则所有插入所需的锁。如果存在冲突,并发处理器会“休眠”一段固定的时间间隔,然后重试。由于 UNIX 中缺少信号量(或等效机制),因此需要锁定整个关系并休眠固定的时间间隔。由于当前实现的并发控制可能具有较高的开销,因此可以将其关闭。

The choice of a crude locking unit reflects our environment where core storage for a large lock table is not available. Our current implementation uses a LOCK relation into which a tuple for each lock requested is inserted. This entire relation is physically locked and then interrogated for conflicting locks. If none exist, all needed locks are inserted. If a conflict exists, the concurrency processor “sleeps” for a fixed interval and then tries again. The necessity to lock the entire relation and to sleep for a fixed interval results from the absence of semaphores (or an equivalent mechanism) in UNIX. Because concurrency control can have high overhead as currently implemented, it can be turned off.

INGRES 设计者正在考虑编写一个设备驱动程序(通常为新设备编写的 UNIX 的干净扩展)来缓解信号量的缺乏。该驱动程序只需维护核心表即可在 UNIX 中实现所需的同步和物理锁定。

The INGRES designers are considering writing a device driver (a clean extension to UNIX routinely written for new devices) to alleviate the lack of semaphores. This driver would simply maintain core tables to implement desired synchronization and physical locking in UNIX.

这些锁由并发处理器持有,直到在管道 E 上收到终止消息为止。只有到那时它才会删除其锁。

The locks are held by the concurrency processor until a termination message is received on pipe E. Only then does it delete its locks.

将来,我们计划实验性地实现 [ 26 ] 中描述的谓词锁定方案的原始(因此 CPU 开销较低)版本。这种方法可以以可接受的锁表空间和 CPU 时间开销提供相当大的并发性,尽管这种语句具有很强的推测性。

In the future we plan to experimentally implement a crude (and thereby low CPU overhead) version of the predicate locking scheme described in [26]. Such an approach may provide considerable concurrency at an acceptable overhead in lock table space and CPU time, although such a statement is highly speculative.

作为本节的总结,我们简要说明了在“类似 ISAM”访问方法中不对页面及其溢出页面进行排序的原因。[ 17 ]中也讨论了这个主题。

To conclude this section, we briefly indicate the reasoning behind not sorting a page and its overflow pages in the “ISAM-like” access method. This topic is also discussed in [17].

所提议的用于 UNIX 中锁定的设备驱动程序必须至少确保单个 UNIX 页的读取-修改-写入是原子操作。否则,仍然需要 INGRES 将整个 LOCK 关系锁定为插入锁。此外,如果没有这样的原子操作,任何提议的谓词锁定方案都无法起作用。如果锁定单元是 UNIX 页,则当主页及其溢出页无序时,INGRES 可以通过一次仅持有一个锁来从关系中插入和删除元组。然而,维护这些页面的排序顺序可能需要访问方法在插入元组时锁定多个页面。显然,在并发更新的情况下可能会出现死锁,并且设备驱动程序中的锁表的大小是不可预测的。为了避免这两个问题,这些页面保持未排序。

The proposed device driver for locking in UNIX must at least ensure that read-modify-write of a single UNIX page is an atomic operation. Otherwise, INGRES would still be required to lock the whole LOCK relation to insert locks. Moreover, any proposed predicate locking scheme could not function without such an atomic operation. If the lock unit is a UNIX page, then INGRES can insert and delete a tuple from a relation by holding only one lock at a time if a primary page and its overflow page are unordered. However, maintenance of the sort order of these pages may require the access method to lock more than one page when it inserts a tuple. Clearly deadlock may be possible given concurrent updates, and the size of the lock table in the device driver is not predictable. To avoid both problems these pages remain unsorted.

5  流程3

5  Process 3

如第 2 节所述,该进程执行以下两个功能,我们将依次讨论:

As noted in Section 2, this process performs the following two functions, which will be discussed in turn:

(a) 将涉及多个变量的查询分解为单变量查询序列。部分结果会被累积,直到评估整个查询。该程序称为 DECOMP。它还将任何更新转换为适当的查询,以隔离合格的元组,并将修改假脱机到一个特殊文件中以进行延迟更新。

(a) Decomposition of queries involving more than one variable into sequences of one-variable queries. Partial results are accumulated until the entire query is evaluated. This program is called DECOMP. It also turns any updates into the appropriate queries to isolate qualifying tuples and spools modifications into a special file for deferred update.

(b) 单变量查询的处理。该程序称为单变量查询处理器 (OVQP)。

(b) Processing of single-variable queries. The program is called the one-variable query processor (OVQP).

5.1  分解

5.1  DECOMP

因为 INGRES 允许在可能多个关系的叉积上定义的交互,所以有效执行此步骤对于搜索尽可能小的适当叉积空间部分至关重要。DECOMP 在处理交互中使用三种技术。我们描述每种技术,然后给出实际实现的算法,然后通过一个示例来说明所有功能。最后,我们指出了正在设计的更复杂的分解方案的作用。

Because INGRES allows interactions which are defined on the crossproduct of perhaps several relations, efficient execution of this step is of crucial importance in searching as small a portion of the appropriate crossproduct space as possible. DECOMP uses three techniques in processing interactions. We describe each technique, and then give the actual algorithm implemented followed by an example which illustrates all features. Finally we indicate the role of a more sophisticated decomposition scheme under design.

(a) 元组替换。DECOMP 用于将查询减少到更少变量的基本技术是元组替换。选择查询中的一个变量(可能是多个)进行替换。AMI 语言用于一次扫描一个元组与变量关联的关系。对于每个元组,该关系中的域值被替换到查询中。在生成的修改后的查询中,所有先前对替换变量的引用现在都已替换为值(常量),因此查询已减少到少一个变量。对修改后的查询重复(递归)分解,直到只剩下一个变量,此时调用 OVQP 继续处理。

(a) Tuple substitution. The basic technique used by DECOMP to reduce a query to fewer variables is tuple substitution. One variable (out of possibly many) in the query is selected for substitution. The AMI language is used to scan the relation associated with the variable one tuple at a time. For each tuple the values of domains in that relation are substituted into the query. In the resulting modified query, all previous references to the substituted variable have now been replaced by values (constants) and the query has thus been reduced to one less variable. Decomposition is repeated (recursively) on the modified query until only one variable remains, at which point the OVQP is called to continue processing.

(b) 一变量分离。如果查询的资格 Q 的形式为

(b) One-variable detachment. If the qualification Q of the query is of the form

图像

对于某个元组变量 V 1,可以执行以下两个步骤:

for some tuple variable V1, the following two steps can be executed:

1. 发出查询

1.  Issue the query

  检索到 W (TL [V 1 ])

  RETRIEVE INTO W (TL[V1])

  其中 Q 1 [V 1 ]

  WHERE Q1[V1]

这里 TL[ V 1 ] 是查询其余部分所需的那些域。请注意,这是一个单变量查询,可以直接传递给 OVQP。

Here TL[V1] are those domains required in the remainder of the query. Note that this is a one-variable query and may be passed directly to OVQP.

2.在范围声明中用 W替换R 1 ( V 1范围内的关系) ,并从 Q 中删除Q 1 [ V 1 ]。

2.  Replace R1, the relation over which V1 ranges, by W in the range declaration and delete Q1[V1] from Q.

步骤 1 中形成的查询称为“单变量可分离子查询”,形成和执行它的技术称为“单变量分离”(OVD)。此步骤具有减小V 1关系的大小的效果 范围受到限制和投影。因此,它可以降低后续处理的复杂性。

The query formed in step 1 is called a “one-variable, detachable subquery,” and the technique for forming and executing it is called “one-variable detachment” (OVD). This step has the effect of reducing the size of the relation over which V1 ranges by restriction and projection. Hence it may reduce the complexity of the processing to follow.

此外,在通过 OVD 创建新关系的过程中存在选择存储结构(尤其是密钥)的机会,这将有助于进一步处理。

Moreover, the opportunity exists in the process of creating new relations through OVD, to choose storage structures, and particularly keys, which will prove helpful in further processing.

(c) 重新格式化。当选择一个元组变量进行替换时,将执行大量查询,每个查询少一个变量。如果(b)是替换一些剩余变量V 1后的可能操作,则V 1范围内的关系R 1可以被重新格式化以具有在Q 1 ( V 1 )中使用的域作为关键字。每次在元组替换期间执行时,这都会加快 (b) 的速度。

(c) Reformatting. When a tuple variable is selected for substitution, a large number of queries, each with one less variable, will be executed. If (b) is a possible operation after the substitution for some remaining variable V1, then the relation over which V1 ranges, R1, can be reformatted to have domains used in Q1(V1) as a key. This will expedite (b) each time it is executed during tuple substitution.

我们现在可以说出完整的分解算法。完成此操作后,我们将通过示例来说明所有步骤。

We can now state the complete decomposition algorithm. After doing so, we illustrate all steps with an example.

步骤1.如果查询的变量个数为0或1,则调用OVQP,然后返回;否则继续步骤 2。

Step 1. If the number of variables in the query is 0 or 1, call OVQP and then return; else go on to step 2.

步骤 2. 查找查询包含单变量子句的所有变量 { V 1 , …, V n }。

Step 2. Find all variables, {V1, …, Vn}, for which the query contains a one-variable clause.

执行 OVD 为每个变量创建新范围。每个变量Vi新关系存储为哈希文件,其密钥K i选择如下:

Perform OVD to create new ranges for each of these variables. The new relation for each variable Vi is stored as a hash file with key Ki chosen as follows:

2.1. 对于每个j,查询中剩余的多变量子句中选择集合C ij,其形式为Vi 。d i = V j。d j,其中di , d jViV j的域。

2.1. For each j select from the remaining multivariable clauses in the query the collection, Cij, which have the form Vi. di = Vj. dj, where di, dj are domains of Vi and Vj.

2.2. 从密钥K i出现在C ij子句中的Vi域di 1 , di 2 , … 的串联。

2.2. From the key Ki to be the concatenation of domains di1, di2, … of Vi appearing in clauses in Cij.

2.3. 如果存在多个j,并且C ij非空,则任意选择一个C ij来形成密钥。如果C ij对于所有j均为空,则该关系将存储为未排序的表。

2.3. If more than one j exists, for which Cij is nonempty, one Cij is chosen arbitrarily for forming the key. If Cij is empty for all j, the relation is stored as an unsorted table.

步骤 3. 选择元组数最少的变量V s作为下一个要执行元组替换的变量。

Step 3. Choose the variable Vs with the smallest number of tuples as the next one for which to perform tuple substitution.

步骤 4.对于C js为非空的每个元组变量V j ,如有必要,重新格式化其范围内的关系R j的存储结构,以便R j的键是出现在中的域d j 1 , … 的串联C.js. _ _ 这确保了当C js中的子句替换V s后变为单变量时,后续调用 OVQP 以进一步限制V j的范围将尽可能高效地完成。

Step 4. For each tuple variable Vj for which Cjs is nonnull, reformat if necessary the storage structure of the relation Rj over which it ranges so that the key of Rj is the concatenation of domains dj1, … appearing in Cjs. This ensures that when the clauses in Cjs become one-variable after substituting for Vs, subsequent calls to OVQP to restrict further the range of Vj will be done as efficiently as possible.

步骤 5. 对步骤 3 中选择的变量范围内的所有元组迭代以下步骤,然后返回:

Step 5. Iterate the following steps over all tuples in the range of the variable selected in step 3 and then return:

5.1. 将元组中的值替换到查询中。

5.1. Substitute values from tuple into query.

 5.2. 对结果查询的副本递归调用分解算法,该副本现在已减少一个变量。

 5.2. Invoke decomposition algorithm recursively on a copy of resulting query which now has been reduced by one variable.

 5.3. 将 5.2 的结果与之前迭代的结果合并。

 5.3. Merge the results from 5.2 with those of previous iterations.

我们使用以下查询来说明该算法:

We use the following query to illustrate the algorithm:

图像

此请求适用于一楼 40 岁以上且收入高于经理的员工。

This request is for employees over 40 on the first floor who earn more than their manager.

1级。

LEVEL 1.

步骤 1. 查询不是一个变量。

Step 1. Query is not one variable.

步骤 2. 发出两个查询:

Step 2. Issue the two queries:

图像

T1 在 DEPT 上散列存储;然而,该算法必须在 MANAGER 或 DEPT 上的散列 T2 之间任意选择。假设它选择 MANAGER。原来的查询现在变成:

T1 is stored hashed on DEPT; however, the algorithm must choose arbitrarily between hashing T2 on MANAGER or DEPT. Suppose it chooses MANAGER. The original query now becomes:

D IS TI 范围

RANGE OF D IS TI

E 的范围是 T2

RANGE OF E IS T2

M 的范围是员工

RANGE OF M IS EMPLOYEE

检索(E.NAME)

RETRIEVE (E.NAME)

图像

步骤 3. 假设 T1 具有最小基数。因此选择D进行替换。

Step 3. Suppose T1 has smallest cardinality. Hence D is chosen for substitution.

步骤4.重新格式化T2以在DEPT上进行散列;上面第 2 步中选择的猜测结果很差。

Step 4. Reformat T2 to be hashed on DEPT; the guess chosen in step 2 above was a poor one.

步骤 5. 迭代 T1 中的每个元组,然后退出:

Step 5. Iterate for each tuple in T1 and then quit:

5.1 D.DEPT 产量的替代值

5.1 Substitute value for D. DEPT yielding

图像

5.2. 从步骤 1 开始,将上述查询作为输入(下面的级别 2)。

5.2. Start at step 1 with the above query as input (Level 2 below).

5.3. 累积合并获得的结果。

5.3. Cumulatively merge results as they are obtained.

2 级。

LEVEL 2.

步骤 1. 查询不是一个变量。

Step 1. Query is not one variable.

步骤 2. 发出查询

Step 2. Issue the query

图像

T3 在 MANAGER 上构建散列。上述第 1 级步骤 4 中的 T2 被重新格式化,以便 OVQP 有效地完成此查询(将为 T1 中的每个元组发出一次)。希望与这一步节省的成本相比,重新格式化的成本很小。剩下的就是

T3 is constructed hashed on MANAGER. T2 in step 4 in Level 1 above is reformatted so that this query (which will be issued once for each tuple in T1) will be done efficiently by OVQP. Hopefully the cost of reformatting is small compared to the savings at this step. What remains is

图像

步骤3.T3的元组比EMPLOYEE少;所以选择T3。

Step 3. T3 has less tuples than EMPLOYEE; therefore choose T3.

步骤4.【非必要】

Step 4. [unnecessary]

步骤 5. 迭代 T3 中的每个元组,然后返回到上一级:

Step 5. Iterate for each tuple in T3 and then return to previous level:

5.1. 替换 E.NAME、E.SALARY 和 E.MANAGER 的值,得到

5.1. Substitute values for E.NAME, E.SALARY, and E.MANAGER, yielding

图像

5.2. 从步骤 1 开始,将此查询作为输入(下面的级别 3)。

5.2. Start at step 1 with this query as input (Level 3 below).

5.3. 累积合并获得的结果。

5.3. Cumulatively merge results as obtained.

3 级。

LEVEL 3.

步骤1.查询有一个变量;调用 OVQP,然后返回到上一级。

Step 1. Query has one variable; invoke OVQP and then return to previous level.

因此,该算法将原始查询分解为四个原型、标记为 (1)–(4) 的单变量查询,其中一些查询使用不同的常量值重复执行,并适当合并结果。查询(1)和(2)执行一次,查询(3)对T1中的每个元组执行一次,查询(4)的次数等于T1中的元组数量乘以T3中的元组数量。

The algorithm thus decomposes the original query into the four prototype, one-variable queries labeled (1)–(4), some of which are executed repetitively with different constant values and with results merged appropriately. Queries (1) and (2) are executed once, query (3) once for each tuple in T1, and query (4) the number of times equal to the number of tuples in T1 times the number of tuples in T3.

以下对该算法的评论是适当的。

The following comments on the algorithm are appropriate.

(a) OVD 几乎总是可以保证加速处理。不仅可以明智地选择临时关系的存储结构,而且该关系的基数可能比它替换为元组变量的范围的基数小得多。仅当减少很少或没有发生并且重新格式化无效时,它才会失败。

(a) OVD is almost always assured of speeding processing. Not only is it possible to choose the storage structure of a temporary relation wisely, but also the cardinality of this relation may be much less than the one it replaces as the range for a tuple variable. It only fails if little or no reduction takes place and reformatting is unproductive.

应该注意的是,创建的是临时关系而不是合格元组 ID 的列表。基本的权衡是 OVD 必须复制合格的元组,但可以删除投影期间创建的重复项。存储 tuple-id 可以避免复制操作,但代价是重新访问合格的元组并保留重复项。显然,存在每种策略都更优越的情况。INGRES 设计人员选择 OVD 是因为它的性能似乎并不比替代方案更差,可以更准确地选择上述算法步骤 3 中具有最小范围的变量,并生成更清晰的代码。

It should be noted that a temporary relation is created rather than a list of qualifying tuple-id’s. The basic tradeoff is that OVD must copy qualifying tuples but can remove duplicates created during the projection. Storing tuple-id’s avoids the copy operation at the expense of reaccessing qualifying tuples and retaining duplicates. It is clear that cases exist where each strategy is superior. The INGRES designers have chosen OVD because it does not appear to offer worse performance than the alternative, allows a more accurate choice of the variable with the smallest range in step 3 of the algorithm above, and results in cleaner code.

(b) 必要时对与最小元组数相关的变量进行元组替换。这具有减少 OVQP 最终调用数量的效果。

(b) Tuple substitution is done when necessary on the variable associated with the smallest number of tuples. This has the effect of reducing the number of eventual calls on OVQP.

(c) 重新格式化(如有必要)是在知道它通常会用有限扫描的集合替换关系的完整顺序扫描的集合的情况下完成的。这几乎总是会减少处理时间。

(c) Reformatting is done (if necessary) with the knowledge that it will usually replace a collection of complete sequential scans of a relation by a collection of limited scans. This almost always reduces processing time.

(d) 据信该算法可以有效地处理一大类交互。此外,该算法不需要过多的CPU开销来执行。然而,在某些情况下需要更复杂的算法。以下评论适用于此类情况。

(d) It is believed that this algorithm efficiently handles a large class of interactions. Moreover, the algorithm does not require excessive CPU overhead to perform. There are, however, cases where a more elaborate algorithm is indicated. The following comment applies to such cases.

(e) 假设我们有两个或多个策略ST 0ST 1 、 … 、ST n,每个策略都比前一个更好,但也需要更大的开销。进一步假设我们在ST 0上开始交互并运行它的时间量等于ST 1估计开销的一部分。最后,通过简单地计算已处理的第一个替换变量的元组数量,我们可以使用ST 0估算总处理时间。如果这明显大于ST 1的开销,那么我们切换到ST 1否则,我们留下来并使用ST 0完成交互处理。显然,如果需要,可以在ST 1上重复该过程以调用ST 2,等等。

(e) Suppose that we have two or more strategies ST0, ST1,…, STn, each one being better than the previous one but also requiring a greater overhead. Suppose further that we begin an interaction on ST0 and run it for an amount of time equal to a fraction of the estimated overhead of ST1. At the end of that time, by simply counting the number of tuples of the first substitution variable which have already been processed, we can get an estimate for the total processing time using ST0. If this is significantly greater than the overhead of ST1, then we switch to ST1. Otherwise we stay and complete processing the interaction using ST0. Obviously, the procedure can be repeated on ST1 to call ST2 if necessary, and so forth.

本节中详述的算法可以被认为是ST 0。目前正在开发一种更复杂的算法[ 30 ]。

The algorithm detailed in this section may be thought of as ST0. A more sophisticated algorithm is currently under development [30].

5.2  一变量查询处理器(OVQP)

5.2  One-Variable Query Processor (OVQP)

该模块仅涉及在给定特定单变量查询的情况下从单个关系有效访问元组。该程序的初始部分称为“策略”,确定可以有利地使用什么密钥(如果有)来访问关系、该密钥的什么值将用于调用 AMI 例程 FIND 以及访问是否可以直接通过 AMI 到主关系本身的存储结构来完成,或者是否应使用关系上的辅助索引。如果要通过二级索引进行访问,则 STRATEGY 必须选择要使用可能的多个索引中的哪一个。

This module is concerned solely with the efficient accessing of tuples from a single relation given a particular one-variable query. The initial portion of this program, known as STRATEGY, determines what key (if any) may be used profitably to access the relation, what value(s) of that key will be used in calls to the AMI routine FIND, and whether access may be accomplished directly through the AMI to the storage structure of the primary relation itself or if a secondary index on the relation should be used. If access is to be through a secondary index, then STRATEGY must choose which one of possibly many indices to use.

然后根据选择的访问策略检索元组,并由 OVQP 的 SCAN 部分进行处理。这些例程根据查询的限定部分评估每个元组,为限定元组创建目标列表值,并适当地处理目标列表。

Tuples are then retrieved according to the access strategy selected and are processed by the SCAN portion of OVQP. These routines evaluate each tuple against the qualification part of the query, create target list values for qualifying tuples, and dispose of the target list appropriately.

由于 SCAN 相对简单,我们只讨论 STRATEGY 中做出的策略决策。

Since SCAN is relatively straightforward, we discuss only the policy decisions made in STRATEGY.

第一个 STRATEGY 检查指定域值的子句的资格,即以下形式的子句

First STRATEGY examines the qualification for clauses which specify the value of a domain, i.e. clauses of the form

V .domain op 常数

V.domain op constant

或者

or

恒定操作V。领域

constant op V. domain

其中“op”是集合 {=, <,>, ≤, ≥} 之一。此类条款被称为“简单”条款并被组织成一个列表。简单子句中的常量将确定输入到 FIND 的键值以限制随后的扫描。

where “op” is one of the set {=, <,>, ≤, ≥}. Such clauses are termed “simple” clauses and are organized into a list. The constants in simple clauses will determine the key values input to FIND to limit the ensuing scan.

显然,一个非简单子句可能等同于一个简单子句。例如,E.SALARY/2 = 10000 相当于 E.SALARY = 20000。但是,识别和转换此类子句需要通用代数符号操纵器。通过忽略所有非简单子句可以避免此问题。

Obviously a nonsimple clause may be equivalent to a simple one. For example, E.SALARY/2 = 10000 is equivalent to E.SALARY = 20000. However, recognizing and converting such clauses requires a general algebraic symbol manipulator. This issue has been avoided by ignoring all nonsimple clauses.

STRATEGY 必须选择两种访问策略之一:(a) 对主关系发出两个 AMI FIND 命令,然后对限制集之间的关系进行顺序扫描(在“扫描”模式下使用 GET),或者 (b) 发出两个 AMI对某些索引关系执行 FIND 命令,然后在设置的限制之间顺序扫描索引。对于检索到的每个元组,获得“指针”域;这只是主关系中元组的元组 ID。该元组被获取(在“直接”模式下使用 GET)并进行处理。

STRATEGY must select one of two accessing strategies: (a) issuing two AMI FIND commands on the primary relation followed by a sequential scan of the relation (using GET in “scan” mode) between the limits set, or (b) issuing two AMI FIND commands on some index relation followed by a sequential scan of the index between the limits set. For each tuple retrieved the “pointer” domain is obtained; this is simply the tuple-id of a tuple in the primary relation. This tuple is fetched (using GET in “direct” mode) and processed.

为了做出选择,必须确定可用的访问可能性。使用 AMI 函数 PARAMD 可以获得有关主关系的键控信息。索引的名称是从 INDEX 目录中获取的,有关索引的关键信息是通过函数 PARAMI 获取的。

To make the choice, the access possibilities available must be determined. Keying information about the primary relation is obtained using the AMI function PARAMD. Names of indices are obtained from the INDEX catalog and keying information about indices is obtained with the function PARAMI.

此外,必须建立可用访问可能性与简单子句的键值规范之间的兼容性。散列关系需要一个简单的子句指定相等作为运算符才能有用;对于组合(多域)密钥,必须指定所有域。另一方面,ISAM 结构允许范围规范;此外,组合的 ISAM 密钥只需要指定最重要的域。

Further, a compatability between the available access possibilities and the specification of key values by simple clauses must be established. A hashed relation requires that a simple clause specify equality as the operator in order to be useful; for combined (multidomain) keys, all domains must be specified. ISAM structures, on the other hand, allow range specifications; additionally, a combined ISAM key requires only that the most significant domains be specified.

STRATEGY 根据以下访问可能性的优先级顺序检查此类兼容性:(1) 散列主关系,(2) 散列索引,(3) ISAM 主关系,(4) ISAM 索引。这种排序的基本原理与在每种情况下从源关系检索元组所需的预期页面访问次数有关。在下面的分析中,忽略溢出页面的影响(假设四种访问可能性将受到同等影响)。

STRATEGY checks for such a compatability according to the following priority order of access possibilities: (1) hashed primary relation, (2) hashed index, (3) ISAM primary relation, (4) ISAM index. The rationale for this ordering is related to the expected number of page accesses required to retrieve a tuple from the source relation in each case. In the following analysis the effect of overflow pages is ignored (on the assumption that the four access possibilities would be equally affected).

在情况(1)中,提供的键值通过涉及散列函数的计算在一次访问中定位期望的源元组。在情况(2)中,键值类似地在一次访问中定位适当的索引关系元组,但是需要额外的访问来检索适当的主关系元组。对于 ISAM 结构方案,必须检查目录。此查找本身至少会产生一次访问但如果目录是多级的,则可能更多。然后必须访问元组本身。因此,情况(3)需要至少两次(但可能更多)总访问。在情况 (4) 中,索引的使用需要在主关系中进行另一次访问,使得总数至少为 3 次。

In case (1) the key value provided locates a desired source tuple in one access via calculation involving a hashing function. In case (2) the key value similarly locates an appropriate index relation tuple in one access, but an additional access is required to retrieve the proper primary relation tuple. For an ISAM-structured scheme a directory must be examined. This lookup itself incurs at least one access but possibly more if the directory is multilevel. Then the tuple itself must be accessed. Thus case (3) requires at least two (but possibly more) total accesses. In case (4) the use of an index necessitates yet another access in the primary relation, making the total at least three.

为了说明策略,我们指出 5.1 节中的查询 (1)–(4) 会发生什么。

To illustrate STRATEGY, we indicate what happens to queries (1)–(4) from Section 5.1.

假设 EMPLOYEE 是一个 ISAM 关系,其键为 NAME,而 DEPT 在 FLOOR# 上进行散列。此外,存在 AGE 的二级索引,该索引在 AGE 上进行散列,并且存在 SALARY 的二级索引,该索引使用带有 SALARY 键的 ISAM。

Suppose EMPLOYEE is an ISAM relation with a key of NAME, while DEPT is hashed on FLOOR#. Moreover a secondary index for AGE exists which is hashed on AGE, and one for SALARY exists which uses ISAM with a key of SALARY.

查询 (1):存在一个简单子句 (D.FLOOR# = 2)。因此策略(a)适用于散列主关系。

Query (1): One simple clause exists (D.FLOOR# = 2). Hence Strategy (a) is applied against the hashed primary relation.

查询 (2):存在一个简单子句 (E.AGE > 40)。但是,它不能用于限制对哈希索引的扫描。因此,需要对员工进行完整(未加密)扫描。如果 AGE 的索引是 ISAM 关系,则策略 (b) 将用于该索引。

Query (2): One simple clause exists (E.AGE > 40). However, it is not usable to limit the scan on a hashed index. Hence a complete (unkeyed) scan of EMPLOYEE is required. Were the index for AGE an ISAM relation, then Strategy (b) would be used on this index.

查询 (3):存在一个简单的子句,并且 T1 已重新格式化以允许针对散列主关系的策略 (a)。

Query (3): One simple clause exists and T1 has been reformatted to allow Strategy (a) against the hashed primary relation.

查询 (4):存在两个简单子句 (value2 > M.SALARY; value3 = M.NAME)。策略 (a) 适用于散列主关系,策略 (b) 适用于 ISAM 索引。该算法选择策略(a)。

Query (4): Two simple clauses exist (value2 > M.SALARY; value3 = M.NAME). Strategy (a) is available on the hashed primary relation, as is Strategy (b) for the ISAM index. The algorithm chooses Strategy (a).

6  进程中的实用程序 4

6  Utilities in Process 4

6.1  实用命令的实现

6.1  Implementation of Utility Commands

我们在第 1 节中指出了几种可供用户使用的数据库实用程序。如前所述,这些命令被组织成几个覆盖程序。根据需要将所需的覆盖层带入核心是通过一种简单的方式完成的。

We have indicated in Section 1 several database utilities available to users. These commands are organized into several overlay programs as noted previously. Bringing the required overlay into core as needed is done in a straightforward way.

大多数实用程序使用 AMI 调用更新或读取系统关系。MODIFY 包含一个排序例程,该例程根据所需键(不需要具有相同数据类型)的串联将元组按整理顺序排列。页面最初加载到容量的大约 80%。排序例程是递归N路合并排序,其中N是进程 4 可以一次打开的最大文件数(当前为 8)。指数的建立是以一种显而易见的方式进行的。要转换为散列结构,MODIFY 必须指定要分配的主页数。该参数由 AMI 在其散列方案(这是标准模除法)中使用。

Most of the utilities update or read the system relations using AMI calls. MODIFY contains a sort routine which puts tuples in collating sequence according to the concatenation of the desired keys (which need not be of the same data type). Pages are initially loaded to approximately 80 percent of capacity. The sort routine is a recursive N-way merge-sort where N is the maximum number of files process 4 can have open at once (currently eight). The index building occurs in an obvious way. To convert to hash structures, MODIFY must specify the number of primary pages to be allocated. This parameter is used by the AMI in its hash scheme (which is a standard modulo division method).

应该注意的是,如果用户使用 CREATE 命令创建空哈希关系,然后使用 COPY 将大型 UNIX 文件复制到其中,则会创建效率非常低的结构。这是因为 CREATE 将指定相对较小的默认主页面数量,并且溢出链将会很长。更好的策略是 COPY 到未排序的表中,以便 MODIFY 随后可以很好地猜测要分配的主页数。

It should be noted that a user who creates an empty hash relation using the CREATE command and then copies a large UNIX file into it using COPY creates a very inefficient structure. This is because a relatively small default number of primary pages will have been specified by CREATE, and overflow chains will be long. A better strategy is to COPY into an unsorted table so that MODIFY can subsequently make a good guess at the number of primary pages to allocate.

6.2  延迟更新和恢复

6.2  Deferred Update and Recovery

任何更新(APPEND、DELETE、REPLACE)都是通过将要添加、更改或修改的元组写入临时文件来处理的。当进程 3 完成时,它调用进程 4 来实际执行所请求的修改以及对二级索引的任何更新,这可能是处理的最后一步所需的。推迟更新有四个原因。

Any updates (APPEND, DELETE, REPLACE) are processed by writing the tuples to be added, changed, or modified into a temporary file. When process 3 finishes, it calls process 4 to actually perform the modifications requested and any updates to secondary indices which may be required as a final step in processing. Deferred update is done for four reasons.

(a) 二级指数考虑因素。假设执行以下 Q UEL语句:

(a) Secondary index considerations. Suppose the following QUEL statement is executed:

 E 的范围是雇员

 RANGE OF E IS EMPLOYEE

 替换 E(工资 = 1.1*E.工资)

 REPLACE E(SALARY = 1.1*E.SALARY)

 其中 E.Salary > 20000

 WHERE E.SALARY > 20000

进一步假设薪资域上有一个二级索引,并且主关系以另一个域为键。

Suppose further that there is a secondary index on the salary domain and the primary relation is keyed on another domain.

OVQP 在寻找有资格加薪的员工时,将使用二级索引。如果一名员工符合资格,并且他的元组被修改并且二级索引被更新,则二级索引的扫描将在其向前移动后第二次找到他的元组。(事实上​​,他的元组将被找到任意次数。)当范围限定存在时,二级索引不能用于识别合格的元组(相当不自然的限制),或者二级索引必须以延迟模式更新。

OVQP, in finding the employees who qualify for the raise, will use the secondary index. If one employee qualifies and his tuple is modified and the secondary index updated, then the scan of the secondary index will find his tuple a second time since it has been moved forward. (In fact, his tuple will be found an arbitrary number of times.) Either secondary indexes cannot be used to identify qualifying tuples when range qualifications are present (a rather unnatural restriction), or secondary indices must be updated in deferred mode.

(b) 主要关系考虑因素。假设 Q UEL语句

(b) Primary relation considerations. Suppose the QUEL statement

图像

针对以下 EMPLOYEE 关系执行:

is executed for the following EMPLOYEE relation:

姓名

NAME

薪水

SALARY

经理

MANAGER

史密斯

Smith

10K

10K

琼斯

Jones

琼斯

Jones

8K

8K

棕色的

Brown

9.5K

9.5K

史密斯

Smith

从逻辑上讲,史密斯应该减薪,而布朗则不应该。然而,如果史密斯的元组在布朗检查减薪之前更新,布朗将符合资格。必须通过延迟更新来避免这种不良情况。

Logically Smith should get the pay cut and Brown should not. However, if Smith’s tuple is updated before Brown is checked for the pay cut, Brown will qualify. This undesirable situation must be avoided by deferred update.

(c) 更新的功能。假设执行以下 Q UEL语句:

(c) Functionality of updates. Suppose the following QUEL statement is executed:

 E、M 的范围是员工

 RANGE OF E, M IS EMPLOYEE

 替换 E(薪水 = M.SALARY)

 REPLACE E(SALARY = M.SALARY)

此更新尝试将每个其他员工的工资分配给每个员工,即单个数据项将被多个值替换。换句话说,REPLACE 语句不指定函数。在某些情况下(例如仅涉及一个元组变量的 REPLACE),功能是得到保证的。然而,一般来说,更新的功能取决于数据。仅当执行延迟更新时才能检查此非功能性。

This update attempts to assign to each employee the salary of every other employee, i.e. a single data item is to be replaced by multiple values. Stated differently, the REPLACE statement does not specify a function. In certain cases (such as a REPLACE involving only one tuple variable) functionality is guaranteed. However, in general the functionality of an update is data dependent. This nonfunctionality can only be checked if deferred update is performed.

为此,延迟更新处理器必须检查 REPLACE 调用中是否有重复的 TID(这需要对更新文件进行排序或散列)。当前实现中不存在这种潜在昂贵的操作,但将来可以选择使用。

To do so, the deferred update processor must check for duplicate TIDs in REPLACE calls (which requires sorting or hashing the update file). This potentially expensive operation does not exist in the current implementation, but will be optionally available in the future.

(d) 恢复考虑因素。延迟更新文件提供要进行的更新的日志。系统崩溃时通过 RESTORE 命令提供恢复。在这种情况下,如果延迟更新例程尚未开始处理临时文件,则请求它销毁该临时文件。如果它已经开始处理,它将重新处理整个更新文件,其效果与从开始到结束只处理一次相同。

(d) Recovery considerations. The deferred update file provides a log of updates to be made. Recovery is provided upon system crash by the RESTORE command. In this case the deferred update routine is requested to destroy the temporary file if it has not yet started processing it. If it has begun processing, it reprocesses the entire update file in such a way that the effect is the same as if it were processed exactly once from start to finish.

因此,如果延迟更新尚未开始,则更新将“取消”;否则处理结束。该软件的设计使得更新文件可以选择性地存储到磁带上并从磁带中恢复。这项新增功能应该很快就能投入使用。

Hence the update is “backed out” if deferred updating has not yet begun; otherwise it is processed to conclusion. The software is designed so the update file can be optionally spooled onto tape and recovered from tape. This added feature should soon be operational.

如果终端监视器(或 C 程序)的用户希望停止命令,他可以发出“break”字符。在这种情况下,除了延迟更新程序之外,所有进程都会重置,该程序以与上述相同的方式恢复。

If a user from the terminal monitor (or a C program) wishes to stop a command he can issue a “break” character. In this case all processes reset except the deferred update program, which recovers in the same manner as above.

所有更新命令均进行延迟更新;然而,INGRES 实用程序尚未进行修改以实现同样的功能。完成此操作后,INGRES 将恢复避免所有使磁盘完好无损的崩溃。同时,可能会出现磁盘完好无损的崩溃,无法以这种方式恢复(如果它们以系统目录不一致的方式发生)。

All update commands do deferred update; however the INGRES utilities have not yet been modified to do likewise. When this has been done, INGRES will recover from all crashes which leave the disk intact. In the meantime there can be disk-intact crashes which cannot be recovered in this manner (if they happen in such a way that the system catalogs are left inconsistent).

INGRES“超级用户”可以使用 UNIX 备份方案将数据库检查点到磁带上。由于 INGRES 记录所有交互,因此通过恢复最后一个检查点并运行交互日志(或延迟更新的磁带,如果存在),始终可以获得一致的系统,尽管速度很慢。

The INGRES “superuser” can checkpoint a database onto tape using the UNIX backup scheme. Since INGRES logs all interactions, a consistent system can always be obtained, albeit slowly, by restoring the last checkpoint and running the log of interactions (or the tape of deferred updates if it exists).

需要注意的是,延迟更新是一个非常昂贵的操作。一位 INGRES 用户选择在流程 3 中直接执行更新,因为他认识到必须避免执行会错误运行的交互。与功能检查一样,将来可以选择直接更新。当然,必须实施不同的恢复方案。

It should be noted that deferred update is a very expensive operation. One INGRES user has elected to have updates performed directly in process 3, cognizant that he must avoid executing interactions which will run incorrectly. Like checks for functionality, direct update may be optionally available in the future. Of course, a different recovery scheme must be implemented.

7  结论和未来的扩展

7  Conclusion and Future Extensions

本文描述的系统已在大约十五个设施中使用。它构成了会计系统、学生记录管理系统、地理数据系统、大型电话公司电缆故障报告和维护呼叫管理系统以及其他各种小型应用程序的基础。这些应用程序已运行长达九个月。

The system described herein is in use at about fifteen installations. It forms the basis of an accounting system, a system for managing student records, a geodata system, a system for managing cable trouble reports and maintenance calls for a large telephone company, and assorted other smaller applications. These applications have been running for periods of up to nine months.

7.1  性能

7.1  Performance

目前尚未进行详细的性能测量,因为当前版本(标记为版本 5)已经运行了不到两个月。我们已经对代码进行了检测,并且正在收集此类测量结果。

At this time no detailed performance measurements have been made, as the current version (labeled Version 5) has been operational for less than two months. We have instrumented the code and are in the process of collecting such measurements.

INGRES 中进程的大小(以字节为单位)如下所示。由于访问方法加载了进程 2 和 3 以及许多实用程序,因此已分别注明它们对各自进程大小的贡献。

The sizes (in bytes) of the processes in INGRES are indicated below. Since the access methods are loaded with processes 2 and 3 and with many of the utilities, their contribution to the respective process sizes has been noted separately.

访问方法(AM)

access methods (AM)

11K

11K

终端监控器

terminal monitor

10K

10K

埃奎尔_

EQUEL

30K + 中波

30K + AM

流程2

process 2

45K+AM

45K +AM

进程3(查询处理器)

process 3 (query processor)

45K+AM

45K +AM

实用程序(8 个覆盖层)

utilities (8 overlays)

160K + 中波

160K + AM

7.2 用户反馈

7.2 User Feedback

来自内部和外部用户的反馈非常积极。

The feedback from internal and external users has been overwhelmingly positive.

在本节中,我们指出了为未来系统建议的功能。

In this section we indicate features that have been suggested for future systems.

(a) 提高绩效。INGRES 的早期版本非常慢;当前版本应该可以缓解这个问题。

(a)  Improved performance. Earlier versions of INGRES were very slow; the current version should alleviate this problem.

(b) 递归。Q UEL不支持递归,必须使用预编译器在 C 中进行繁琐的编程;递归功能已被建议作为所需的扩展。

(b)  Recursion. QUEL does not support recursion, which must be tediously programmed in C using the precompiler; recursion capability has been suggested as a desired extension.

(c) 其他语言扩展。其中包括用户定义的函数(尤其是计数器)、单个限定语句的多个目标列表以及 Q UEL中的 if-then-else 控制结构;目前可以使用预编译器对这些功能进行编程,但效率非常低。

(c)  Other language extensions. These include user defined functions (especially counters), multiple target lists for a single qualfiication statement, and if-then-else control structures in QUEL; these features may presently be programmed, but only very inefficiently, using the precompiler.

(d) 报告生成器。PRINT 是一个非常原始的报告生成器,该领域对增强设施的需求是显而易见的;它应该写在 E QUEL中。

(d)  Report generator. PRINT is a very primitive report generator and the need for augmented facilities in this area is clear; it should be written in EQUEL.

(e) 批量复制。COPY 例程无法轻松处理出现的所有情况。

(e)  Bulk copy. The COPY routine fails to handle easily all situations that arise.

7.3  未来的扩展

7.3  Future Extensions

整篇论文指出了正在进行、计划或用户期望的系统改进领域。其他扩展领域包括: (a) INGRES 的多计算机系统版本,用于在分布式数据库上运行;(b) 进一步提高绩效;(c) 高级用户语言,包括递归和用户定义函数;(d) 更好的数据定义和完整性特征;(e) 数据库管理员顾问。

Noted throughout the paper are areas where system improvement is in progress, planned, or desired by users. Other areas of extension include: (a) a multicomputer system version of INGRES to operate on distributed databases; (b) further performance enhancements; (c) a higher level user language including recursion and user defined functions; (d) better data definition and integrity features; and (e) a database administrator advisor.

数据库管理员顾问程序将以空闲优先级运行,并对 INGRES 保存的统计关系发出查询。然后,它可以向 DBA 提供有关访问方法的选择和索引的选择的建议。该主题在[ 16 ]中进一步讨论。

The database administrator advisor program would run at idle priority and issue queries against a statistics relation to be kept by INGRES. It could then offer advice to a DBA concerning the choice of access methods and the selection of indices. This topic is discussed further in [16].

致谢

Acknowledgment

以下人员在 INGRES 的设计和实施中发挥了积极作用:Eric Allman、Rick Berman、Jim Ford、Angela Go、Nancy McDonald、Peter Rubinstein、Iris Schoenberg、Nick Whyte、Carol Williams、Karel Youssefi 和 Bill Zook。

The following persons have played active roles in the design and implementation of INGRES: Eric Allman, Rick Berman, Jim Ford, Angela Go, Nancy McDonald, Peter Rubinstein, Iris Schoenberg, Nick Whyte, Carol Williams, Karel Youssefi, and Bill Zook.

参考

References

  [ 1 ] Allman, E.、Stonebraker, M. 和 Held, G. 在通用编程语言中嵌入关系数据子语言。过程。会议。关于数据,SIGPLAN 公告 (ACM) 8, 2 (1976), 25–35。

  [1]  Allman, E., Stonebraker, M., and Held, G. Embedding a relational data sublanguage in a general purpose programming language. Proc. Conf. on Data, SIGPLAN Notices (ACM) 8, 2 (1976), 25–35.

[ 2 ]阿斯特拉汉,MM,等人。System R:数据库管理的关系方法。ACM 翻译。关于数据库系统 1 , 2 (1976 年 6 月), 97–137。

[2]  Astrahan, M. M., et al. System R: Relational approach to database management. ACM Trans. on Database Systems 1, 2 (June 1976), 97–137.

  [ 3 ]博伊斯,R.,等人。将查询指定为关系表达式:SQUARE。众议员 RJ 1291,IBM Res。实验室,加利福尼亚州圣何塞,1973 年 10 月。

  [3]  Boyce, R., et al. Specifying queries as relational expessions: SQUARE. Rep. RJ 1291, IBM Res. Lab., San Jose, Calif., Oct. 1973.

  [ 4 ] Chamberlin, D. 和 Boyce, R. SEQUEL:一种结构化英语查询语言。过程。1974 年 ACM-SIGMOD 数据描述、访问和控制研讨会,密歇根州安娜堡,1974 年 5 月,第 249-264 页。

  [4]  Chamberlin, D., and Boyce, R. SEQUEL: A structured English query language. Proc. 1974 ACM-SIGMOD Workshop on Data Description, Access and Control, Ann Arbor, Mich., May 1974, pp. 249–264.

  [ 5 ]Chamberlin, D.、Gray, JN 和 Traiger, IL 关系数据库系统中的视图、授权和锁定。过程。AFIPS 1975 NCC,卷。44,AFIPS 出版社,新泽西州蒙特维尔,1975 年 5 月,第 425-430 页。

  [5]  Chamberlin, D., Gray, J.N., and Traiger, I.L. Views, authorization and locking in a relational data base system. Proc. AFIPS 1975 NCC, Vol. 44, AFIPS Press, Montvale, N.J., May 1975, pp. 425–430.

  [ 6 ]通讯。关于数据系统语言。CODASYL 数据库任务组代表,ACM,纽约,1971 年。

  [6]  Comm. on Data Systems Languages. CODASYL Data Base Task Group Rep., ACM, New York, 1971.

  [ 7 ] Codd, EF 大型共享数据库的数据关系模型。通讯。ACM 13, 6(1970 年 6 月),377–387。

  [7]  Codd, E.F. A relational model of data for large shared data banks. Comm. ACM 13, 6 (June 1970), 377–387.

  [ 8 ] Codd, EF 基于关系演算的数据库子语言。过程。1971 年 ACM-SIGFIDET 数据描述、访问和控制研讨会,加利福尼亚州圣地亚哥,1971 年 11 月,第 35-68 页。

  [8]  Codd, E.F. A data base sublanguage founded on the relational calculus. Proc. 1971 ACM-SIGFIDET Workshop on Data Description, Access and Control, San Diego, Calif., Nov. 1971, pp. 35–68.

  [ 9 ] Codd, EF 数据库子语言的关系完整性。Courant 计算机科学症状。6,1971 年 5 月,Prentice-Hall,新泽西州恩格尔伍德悬崖,第 65-90 页。

  [9]  Codd, E.F. Relational completeness of data base sublanguages. Courant Computer Science Symp. 6, May 1971, Prentice-Hall, Englewood Cliffs, N.J., pp. 65–90.

[ 10 ] Codd, EF 和 Date, CJ Interactive 对非程序员、关系和网络方法的支持。过程。1974 年 ACM-SIGMOD 数据描述、访问和控制研讨会,密歇根州安娜堡,1974 年 5 月。

[10]  Codd, E.F., and Date, C.J. Interactive support for non-programmers, the relational and network approaches. Proc. 1974 ACM-SIGMOD Workshop on Data Description, Access and Control, Ann Arbor, Mich., May 1974.

[ 11 ] Date, CJ 和 Codd, EF 关系和网络方法:应用程序编程接口的比较。过程。1974 年 ACM-SIGMOD 数据描述、访问和控制研讨会,卷。II,密歇根州安娜堡,1974 年 5 月,第 85-113 页。

[11]  Date, C.J., and Codd, E.F. The relational and network approaches: Comparison of the application programming interfaces. Proc. 1974 ACM-SIGMOD Workshop on Data Description, Access and Control, Vol. II, Ann Arbor, Mich., May 1974, pp. 85–113.

[ 12 ] Gray, JN、Lorie, RA 和 Putzolu, GR 共享数据库中锁的粒度。过程。国际。会议。超大型数据库,马萨诸塞州弗雷明汉,1975 年 9 月,第 428–451 页。(可从纽约 ACM 获得。)

[12]  Gray, J.N., Lorie, R.A., and Putzolu, G.R. Granularity of Locks in a Shared Data Base. Proc. Int. Conf. of Very Large Data Bases, Framingham, Mass., Sept. 1975, pp. 428–451. (Available from ACM, New York.)

[ 13 ] Go, A.、Stonebraker, M. 和 Williams, C. 一种实施地理数据系统的方法。过程。ACM SIGGRAPH/SIGMOD 会议 交互设计数据库,加拿大安大略省滑铁卢,1975 年 9 月,第 67-77 页。

[13]  Go, A., Stonebraker, M., and Williams, C. An approach to implementing a geo-data system. Proc. ACM SIGGRAPH/SIGMOD Conf. for Data Bases in Interactive Design, Waterloo, Ont., Canada, Sept. 1975, pp. 67–77.

[ 14 ]戈特利布,D.,等人。压缩方法的分类及其在大型数据处理中心中的用途。过程。AFIPS 1975 NCC,卷。44,AFIPS 出版社,新泽西州蒙特维尔,1975 年 5 月,第 453-458 页。

[14]  Gottlieb, D., et al. A classification of compression methods and their usefulness in a large data processing center. Proc. AFIPS 1975 NCC, Vol. 44, AFIPS Press, Montvale, N.J., May 1975, pp. 453–458.

[ 15 ] Held, GD、Stonebraker, M. 和 Wong, E. INGRES——关系数据库管理系统。过程。AFIPS 1975 NCC,卷。44,AFIPS 出版社,新泽西州蒙特维尔,1975 年,第 409-416 页。

[15]  Held, G.D., Stonebraker, M., and Wong, E. INGRES—A relational data base management system. Proc. AFIPS 1975 NCC, Vol. 44, AFIPS Press, Montvale, N.J., 1975, pp. 409–416.

[ 16 ]Held,关系数据库管理系统的 GD 存储结构。博士 Th.,部门。电气工程系 和计算机科学,加利福尼亚大学,伯克利,加利福尼亚州,1975 年。

[16]  Held, G.D. Storage Structures for Relational Data Base Management Systems. Ph.D. Th., Dep. of Electrical Eng. and Computer Science, U. of California, Berkeley, Calif., 1975.

[ 17 ] Held, G. 和 Stonebraker, M. B - 树重新检查。提交给技术期刊。

[17]  Held, G., and Stonebraker, M. B-trees re-examined. Submitted to a technical journal.

[ 18 ]IBM公司操作系统ISAM逻辑。GY28-6618,IBM 公司,纽约怀特普莱恩斯,1966 年。

[18]  IBM Corp. OS ISAM logic. GY28-6618, IBM Corp., White Plains, N.Y., 1966.

[ 19 ] Johnson, SC YACC,另一个编译器-编译器。UNIX 程序员手册,贝尔电话实验室,新泽西州默里山,1974 年 7 月。

[19]  Johnson, S.C. YACC, yet another compiler-compiler. UNIX Programmer’s Manual, Bell Telephone Labs, Murray Hill, N.J., July 1974.

[ 20 ] McDonald, N. 和 Stonebraker, M. Cupid——友好的查询语言。过程。ACM-Pacific-75,加利福尼亚州旧金山,1975 年 4 月,第 127-131 页。

[20]  McDonald, N., and Stonebraker, M. Cupid—The friendly query language. Proc. ACM-Pacific-75, San Francisco, Calif., April 1975, pp. 127–131.

[ 21 ] McDonald, N. CUPID:一种面向图形的工具,用于支持非程序员与数据库的交互。博士 Th.,部门。电气工程系 和计算机科学,加利福尼亚大学,伯克利,加利福尼亚州,1975 年。

[21]  McDonald, N. CUPID: A graphics oriented facility for support of non-programmer interactions with a data base. Ph.D. Th., Dep. of Electrical Eng. and Computer Science, U. of California, Berkeley, Calif., 1975.

[ 22 ] Ritchie, DM 和 Thompson, K. UNIX 分时系统。通讯。ACM 17,7(1974 年 7 月),365–375。

[22]  Ritchie, D.M., and Thompson, K. The UNIX Time-sharing system. Comm. ACM 17, 7 (July 1974), 365–375.

[ 23 ] Schoenberg, I. 在关系数据库管理系统 INGRES 中实施完整性约束。硕士,部门。电气工程系 和计算机科学,加利福尼亚大学,伯克利,加利福尼亚州,1975 年。

[23]  Schoenberg, I. Implementation of integrity constraints in the relational data base management system, INGRES. M.S. Th., Dep. of Electrical Eng. and Computer Science, U. of California, Berkeley, Calif., 1975.

[ 24 ] Stonebraker, M. 数据独立性的功能视图。过程。1974 年 ACM-SIGFIDET 数据描述、访问和控制研讨会,密歇根州安娜堡,1974 年 5 月。

[24]  Stonebraker, M. A functional view of data independence. Proc. 1974 ACM-SIGFIDET Workshop on Data Description, Access and Control, Ann Arbor, Mich., May 1974.

[ 25 ] Stonebraker, M. 和 Wong, E. 通过查询修改在关系数据库管理系统中进行访问控制。过程。1974 ACM 全国 Conf.,圣地亚哥,加利福尼亚州,1974 年 11 月,第 180–187 页。

[25]  Stonebraker, M., and Wong, E. Access control in a relational data base management system by query modification. Proc. 1974 ACM Nat. Conf., San Diego, Calif., Nov. 1974, pp. 180–187.

[ 26 ] Stonebraker, M. 关系数据库系统中的高级完整性保证。ERI 内存。No. M473,电子研究室。加利福尼亚大学实验室,加利福尼亚州伯克利,1974 年 8 月。

[26]  Stonebraker, M. High level integrity assurance in relational data base systems. ERI Mem. No. M473, Electronics Res. Lab., U. of California, Berkeley, Calif., Aug. 1974.

[ 27 ] Stonebraker, M. 通过查询修改实现完整性约束和视图。过程。1975 年 SIGMOD 数据管理研讨会,加利福尼亚州圣何塞,1975 年 5 月,第 65-78 页。

[27]  Stonebraker, M. Implementation of integrity constraints and views by query modification. Proc. 1975 SIGMOD Workshop on Management of Data, San Jose, Calif., May 1975, pp. 65–78.

[ 28 ] Stonebraker, M. 和 Rubinstein, P. INGRES 保护系统。过程。1976 年 ACM 全国会议,德克萨斯州休斯顿,1976 年 10 月(待出席)。

[28]  Stonebraker, M., and Rubinstein, P. The INGRES protection system. Proc. 1976 ACM National Conf., Houston, Tex., Oct. 1976 (to appear).

[ 29 ] Tsichritzis, D. 关系实现的网络框架。众议员 CSRG-51,计算机系统研究。多伦多大学集团,加拿大安大略省多伦多,1975 年 2 月。

[29]  Tsichritzis, D. A network framework for relational implementation. Rep. CSRG-51, Computer Systems Res. Group, U. of Toronto, Toronto, Ont., Canada, Feb. 1975.

[ 30 ] Wong, E. 和 Youssefi, K. 分解——查询处理策略。ACM 翻译。关于数据库系统 1 , 3 (1976 年 9 月), 223–241 (本期)。

[30]  Wong, E., and Youssefi, K. Decomposition—A strategy for query processing. ACM Trans. on Database Systems 1, 3 (Sept. 1976), 223–241 (this issue).

[ 31 ]Zook,W.,等人。INGRES — 参考手册,5. ERL Mem。No. M585,电子研究室。加利福尼亚大学实验室,加利福尼亚州伯克利,1976 年 4 月。

[31]  Zook, W., et al. INGRES—Reference manual, 5. ERL Mem. No. M585, Electronics Res. Lab., U. of California, Berkeley, Calif., April 1976.

1976 年 1 月收到;1976年4月修订

Received January 1976; revised April 1976

版权所有 © 1976,Association for Computer Machinery, Inc。一般允许重新发布本材料的全部或部分内容,但不得以营利为目的,前提是提供 ACM 的版权声明并引用该出版物及其发行日期,并且转载权已获得计算机协会的许可。这项研究由陆军研究办公室拨款 DAHCO4-74-G0087、海军电子系统指挥合同 N00039-76-C-0022、联合军种电子计划合同 F44620-71-C-0087、国家科学基金会拨款 DCR75-03839 赞助和 ENG74-06651-A01,以及斯隆基金会的资助。

Copyright © 1976, Association for Computing Machinery, Inc. General permission to republish, but not for profit, all or part of this material is granted provided that ACM’s copyright notice is given and that reference is made to the publication, to its date of issue, and to the fact that reprinting privileges were granted by permission of the Association for Computing Machinery. This research was sponsored by Army Research Office Grant DAHCO4-74-G0087, the Naval Electronic Systems Command Contract N00039-76-C-0022, the Joint Services Electronics Program Contract F44620-71-C-0087, National Science Foundation Grants DCR75-03839 and ENG74-06651-A01, and a grant from the Sloan Foundation.

作者地址:M. Stonebraker 和 E. Wong,加州大学伯克利分校电气工程与计算机科学系,Berkeley, CA 94720;P. Kreps,计算机科学与应用数学系,50B 楼,劳伦斯伯克利实验室,加州大学伯克利分校,伯克利,CA 94720;G. Held,Tandem Computers, Inc.,库比蒂诺,CA 95014。

Authors’ addresses: M. Stonebraker and E. Wong, Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA 94720; P. Kreps, Department of Computer Science and Applied Mathematics, Building 50B, Lawrence Berkeley Laboratories, University of California, Berkeley, Berkeley, CA 94720; G. Held, Tandem Computers, Inc., Cupertino, CA 95014.

最初发表于ACM Transactions on Database Systems , 1(3): 189–222, 1976。原始 DOI: 10.1145/320473.320476

Originally published in ACM Transactions on Database Systems, 1(3): 189–222, 1976. Original DOI: 10.1145/320473.320476

迈克尔·斯通布雷克文集

The Collected Works of Michael Stonebraker

DJ Abadi、D. Carney、U. Çetintemel、M. Cherniack、C. Convey、C. Erwin、EF Galvez、M. Hatoun、A. Maskey、A. Rasin、A. Singer、M. Stonebraker、N. Tatbul、 Y. Xing、R. Yan 和 SB Zdonik。2003a. Aurora:数据流管理系统。在过程中。ACM SIGMOD 国际数据管理会议,p。666.DOI:10.1145/872757.872855。225、228、229、230、232

D. J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, C. Erwin, E. F. Galvez, M. Hatoun, A. Maskey, A. Rasin, A. Singer, M. Stonebraker, N. Tatbul, Y. Xing, R. Yan, and S. B. Zdonik. 2003a. Aurora: A data stream management system. In Proc. ACM SIGMOD International Conference on Management of Data, p. 666. DOI: 10.1145/872757.872855. 225, 228, 229, 230, 232

DJ Abadi、D. Carney、U. Çetintemel、M. Cherniack、C. Convey、S. Lee、M. Stonebraker、N. Tatbul 和 SB Zdonik。2003b. Aurora:数据流管理的新模型和架构。VLDB 杂志,12(2):120–139。DOI:10.1007/s00778-003-0095-z。228、229、324

D. J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. B. Zdonik. 2003b. Aurora: a new model and architecture for data stream management. VLDB Journal, 12(2): 120–139. DOI: 10.1007/s00778-003-0095-z. 228, 229, 324

DJ Abadi、R. Agrawal、A. Ailamaki、M. Balazinska、PA Bernstein、MJ Carey、S. Chaudhuri、J. Dean、A. Doan、MJ Franklin、J. Gehrke、LM Haas、AY Halevy、JM Hellerstein、YE Ioannidis、HV Jagadish、D. Kossmann、S. Madden、S. Mehrotra、T. Milo、JF Naughton、R. Ramakrishnan、V. Markl、C. Olston、BC Ooi、C. Ré、D. Suciu、M. Stonebraker 、T.沃尔特和J.维多姆。2014年。贝克曼数据库研究报告。ACM SIGMOD 记录,43(3):61–70。DOI:10.1145/2694428.2694441。92

D. J. Abadi, R. Agrawal, A. Ailamaki, M. Balazinska, P. A. Bernstein, M. J. Carey, S. Chaudhuri, J. Dean, A. Doan, M. J. Franklin, J. Gehrke, L. M. Haas, A. Y. Halevy, J. M. Hellerstein, Y. E. Ioannidis, H. V. Jagadish, D. Kossmann, S. Madden, S. Mehrotra, T. Milo, J. F. Naughton, R. Ramakrishnan, V. Markl, C. Olston, B. C. Ooi, C. Ré, D. Suciu, M. Stonebraker, T. Walter, and J. Widom. 2014. The Beckman report on database research. ACM SIGMOD Record, 43(3): 61–70. DOI: 10.1145/2694428.2694441. 92

D. Abadi、R. Agrawal、A. Ailamaki、M. Balazinska、PA Bernstein、MJ Carey、S. Chaudhuri、J. Dean、A. Doan、MJ Franklin、J. Gehrke、LM Haas、AY Halevy、JM Hellerstein、 YE Ioannidis、HV Jagadish、D. Kossmann、S. Madden、S. Mehrotra、T. Milo、JF Naughton、R. Ramakrishnan、V. Markl、C. Olston、BC Ooi、C. Ré、D. Suciu、M.斯通布雷克、T. 沃尔特和 J. 维多姆。2016年。贝克曼数据库研究报告。ACM 通讯,59(2):92-99。DOI:10.1145/2845915。92

D. Abadi, R. Agrawal, A. Ailamaki, M. Balazinska, P. A. Bernstein, M. J. Carey, S. Chaudhuri, J. Dean, A. Doan, M. J. Franklin, J. Gehrke, L. M. Haas, A. Y. Halevy, J. M. Hellerstein, Y. E. Ioannidis, H. V. Jagadish, D. Kossmann, S. Madden, S. Mehrotra, T. Milo, J. F. Naughton, R. Ramakrishnan, V. Markl, C. Olston, B. C. Ooi, C. Ré, D. Suciu, M. Stonebraker, T. Walter, and J. Widom. 2016. The Beckman report on database research. Communications of the ACM, 59(2): 92–99. DOI: 10.1145/2845915. 92

Z. Abedjan、CG Akcora、M. Ouzzani、P. Papotti 和 M. Stonebraker。2015a. 用于网络数据清理的时间规则发现。过程。VLDB 捐赠,9(4):336–347。http://www.vldb.org/pvldb/vol9/p336-abedjan.pdf。第297章

Z. Abedjan, C. G. Akcora, M. Ouzzani, P. Papotti, and M. Stonebraker. 2015a. Temporal rules discovery for web data cleaning. Proc. VLDB Endowment, 9(4): 336–347. http://www.vldb.org/pvldb/vol9/p336-abedjan.pdf. 297

Z. Abedjan、J. Morcos、MN Gubanov、IF Ilyas、M. Stonebraker、P. Papotti 和 M. Ouzzani。2015b. Dataxformer:利用网络进行语义转换。在过程中。第七届创新数据系统研究双年会http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper31.pdf。296, 297

Z. Abedjan, J. Morcos, M. N. Gubanov, I. F. Ilyas, M. Stonebraker, P. Papotti, and M. Ouzzani. 2015b. Dataxformer: Leveraging the web for semantic transformations. In Proc. 7th Biennial Conference on Innovative Data Systems Research. http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper31.pdf. 296, 297

Z. Abedjan、X. Chu、D. Deng、RC Fernandez、IF Ilyas、M. Ouzzani、P. Papotti、M. Stonebraker 和 N. Tang。2016a. 检测数据错误:我们在哪里以及需要做什么?过程。VLDB 捐赠,9(12):993–1004。http://www.vldb.org/pvldb/vol9/p993-abedjan.pdf。298

Z. Abedjan, X. Chu, D. Deng, R. C. Fernandez, I. F. Ilyas, M. Ouzzani, P. Papotti, M. Stonebraker, and N. Tang. 2016a. Detecting data errors: Where are we and what needs to be done? Proc. VLDB Endowment, 9(12): 993–1004. http://www.vldb.org/pvldb/vol9/p993-abedjan.pdf. 298

Z. Abedjan、J. Morcos、IF Ilyas、M. Ouzzani、P. Papotti 和 M. Stonebraker。2016b. Dataxformer:一个强大的转换发现系统。在过程中。第 32 届国际数据工程会议,第 1134–1145 页。DOI:10.1109/ICDE.2016.7498319.296

Z. Abedjan, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, and M. Stonebraker. 2016b. Dataxformer: A robust transformation discovery system. In Proc. 32nd International Conference on Data Engineering, pp. 1134–1145. DOI: 10.1109/ICDE.2016.7498319.296

S. Abiteboul、R. Agrawal、PA Bernstein、MJ Carey、S. Ceri、WB Croft、DJ DeWitt、MJ Franklin、H. Garcia-Molina、D. Gawlick、J. Gray、LM Haas、AY Halevy、JM Hellerstein、 YE Ioannidis、ML Kersten、MJ Pazzani、M. Lesk、D. Maier、JF Naughton、H. Schek、TK Sellis、A. Silberschatz、M. Stonebraker、RT Snodgrass、JD Ullman、G. Weikum、J. Widom 和SB 兹多尼克. 2003.洛厄尔数据库研究自我评估。CoRR,cs.DB/0310006。http://arxiv.org/abs/cs.DB/0310006.92

S. Abiteboul, R. Agrawal, P. A. Bernstein, M. J. Carey, S. Ceri, W. B. Croft, D. J. DeWitt, M. J. Franklin, H. Garcia-Molina, D. Gawlick, J. Gray, L. M. Haas, A. Y. Halevy, J. M. Hellerstein, Y. E. Ioannidis, M. L. Kersten, M. J. Pazzani, M. Lesk, D. Maier, J. F. Naughton, H. Schek, T. K. Sellis, A. Silberschatz, M. Stonebraker, R. T. Snodgrass, J. D. Ullman, G. Weikum, J. Widom, and S. B. Zdonik. 2003. The Lowell database research self assessment. CoRR, cs.DB/0310006. http://arxiv.org/abs/cs.DB/0310006.92

S. Abiteboul、R. Agrawal、PA Bernstein、MJ Carey、S. Ceri、WB Croft、DJ DeWitt、MJ Franklin、H. Garcia-Molina、D. Gawlick、J. Gray、LM Haas、AY Halevy、JM Hellerstein、 YE Ioannidis、ML Kersten、MJ Pazzani、M. Lesk、D. Maier、JF Naughton、H. Schek、TK Sellis、A. Silberschatz、M. Stonebraker、RT Snodgrass、JD Ullman、G. Weikum、J. Widom 和SB 兹多尼克. 2005。洛厄尔数据库研究自我评估。ACM 通讯,48(5):111–118。DOI:10.1145/1060710.1060718。92

S. Abiteboul, R. Agrawal, P. A. Bernstein, M. J. Carey, S. Ceri, W. B. Croft, D. J. DeWitt, M. J. Franklin, H. Garcia-Molina, D. Gawlick, J. Gray, L. M. Haas, A. Y. Halevy, J. M. Hellerstein, Y. E. Ioannidis, M. L. Kersten, M. J. Pazzani, M. Lesk, D. Maier, J. F. Naughton, H. Schek, T. K. Sellis, A. Silberschatz, M. Stonebraker, R. T. Snodgrass, J. D. Ullman, G. Weikum, J. Widom, and S. B. Zdonik. 2005. The Lowell database research self-assessment. Communications of the ACM, 48(5): 111–118. DOI: 10.1145/1060710.1060718. 92

R. Agrawal、A. Ailamaki、PA Bernstein、EA Brewer、MJ Carey、S. Chaudhuri、A. Doan、D. Florescu、MJ Franklin、H. Garcia-Molina、J. Gehrke、L. Gruenwald、LM Haas、AY Halevy、JM Hellerstein、YE Ioannidis、HF Korth、D. Kossmann、S. Madden、R. Magoulas、BC Ooi、T. O'Reilly、R. Ramakrishnan、S. Sarawagi、M. Stonebraker、AS Szalay 和 G.维库姆。2008 年。克莱蒙特数据库研究报告。ACM SIGMOD 记录,37(3):9–19。DOI:10.1145/1462571.1462573。92

R. Agrawal, A. Ailamaki, P. A. Bernstein, E. A. Brewer, M. J. Carey, S. Chaudhuri, A. Doan, D. Florescu, M. J. Franklin, H. Garcia-Molina, J. Gehrke, L. Gruenwald, L. M. Haas, A. Y. Halevy, J. M. Hellerstein, Y. E. Ioannidis, H. F. Korth, D. Kossmann, S. Madden, R. Magoulas, B. C. Ooi, T. O’Reilly, R. Ramakrishnan, S. Sarawagi, M. Stonebraker, A. S. Szalay, and G. Weikum. 2008. The Claremont report on database research. ACM SIGMOD Record, 37(3): 9–19. DOI: 10.1145/1462571.1462573. 92

R. Agrawal、A. Ailamaki、PA Bernstein、EA Brewer、MJ Carey、S. Chaudhuri、A. Doan、D. Florescu、MJ Franklin、H. Garcia-Molina、J. Gehrke、L. Gruenwald、LM Haas、AY Halevy、JM Hellerstein、YE Ioannidis、HF Korth、D. Kossmann、S. Madden、R. Magoulas、BC Ooi、T. O'Reilly、R. Ramakrishnan、S. Sarawagi、M. Stonebraker、AS Szalay 和 G.维库姆。2009 年。克莱蒙特数据库研究报告。ACM 通讯,52(6):56-65。DOI:10.1145/1516046.1516062。92

R. Agrawal, A. Ailamaki, P. A. Bernstein, E. A. Brewer, M. J. Carey, S. Chaudhuri, A. Doan, D. Florescu, M. J. Franklin, H. Garcia-Molina, J. Gehrke, L. Gruenwald, L. M. Haas, A. Y. Halevy, J. M. Hellerstein, Y. E. Ioannidis, H. F. Korth, D. Kossmann, S. Madden, R. Magoulas, B. C. Ooi, T. O’Reilly, R. Ramakrishnan, S. Sarawagi, M. Stonebraker, A. S. Szalay, and G. Weikum. 2009. The Claremont report on database research. Communications of the ACM, 52(6): 56–65. DOI: 10.1145/1516046.1516062. 92

A. 艾肯、J. Chen、M. Lin、M. Spalding、M. Stonebraker 和 A. Woodruff。1995. Tioga-2 数据库可视化环境。在过程中。数据可视化数据库问题研讨会,第 181-207 页。DOI:10.1007/3-540-62221-7_15。

A. Aiken, J. Chen, M. Lin, M. Spalding, M. Stonebraker, and A. Woodruff. 1995. The Tioga-2 database visualization environment. In Proc. Workshop on Database Issues for Data Visualization, pp. 181–207. DOI: 10.1007/3-540-62221-7_15.

A. 艾肯、J. Chen、M. Stonebraker 和 A. Woodruff。1996. Tioga-2:直接操作数据库可视化环境。在过程中。第 12 届国际数据工程会议,第 208-217 页。DOI:10.1109/ICDE.1996.492109。

A. Aiken, J. Chen, M. Stonebraker, and A. Woodruff. 1996. Tioga-2: A direct manipulation database visualization environment. In Proc. 12th International Conference on Data Engineering, pp. 208–217. DOI: 10.1109/ICDE.1996.492109.

E.奥尔曼和M.斯通布雷克。1982. 对软件系统演化的观察。IEEE 计算机,15(6):27–32。DOI:10.1109/MC.1982.1654047。

E. Allman and M. Stonebraker. 1982. Observations on the evolution of a software system. IEEE Computer, 15(6): 27–32. DOI: 10.1109/MC.1982.1654047.

E.奥尔曼、M.斯通布雷克和G.赫尔德。1976. 在通用编程语言中嵌入关系数据子语言。在过程中。SIGPLAN 数据会议:抽象、定义和结构,第 25-35 页。DOI:10.1145/800237.807115。195

E. Allman, M. Stonebraker, and G. Held. 1976. Embedding a relational data sublanguage in a general purpose programming language. In Proc. SIGPLAN Conference on Data: Abstraction, Definition and Structure, pp. 25–35. DOI: 10.1145/800237.807115. 195

JT 安德森和 M.斯通布雷克。1994 年。SEQOIA 2000 卫星图像元数据模式。ACM SIGMOD 记录,23(4):42–48。DOI:10.1145/190627.190642。

J. T. Anderson and M. Stonebraker. 1994. SEQUOIA 2000 metadata schema for satellite images. ACM SIGMOD Record, 23(4): 42–48. DOI: 10.1145/190627.190642.

A. Arasu、M. Cherniack、EF Galvez、D. Maier、A. Maskey、E. Ryvkina、M. Stonebraker 和 R. Tibbetts。2004.线性道路:流数据管理基准。在过程中。第 30 届超大型数据库国际会议,第 480-491 页。http://www.vldb.org/conf/2004/RS12P1.pdf。第326章

A. Arasu, M. Cherniack, E. F. Galvez, D. Maier, A. Maskey, E. Ryvkina, M. Stonebraker, and R. Tibbetts. 2004. Linear road: A stream data management benchmark. In Proc. 30th International Conference on Very Large Data Bases, pp. 480–491. http://www.vldb.org/conf/2004/RS12P1.pdf. 326

T. Atwoode、J. Dash、J. Stein、M. Stonebraker 和 MES Loomis。1994。对象和数据库(小组)。在过程中。第九届面向对象编程系统、语言和应用程序年会,第 371-372 页。DOI:10.1145/191080.191138。

T. Atwoode, J. Dash, J. Stein, M. Stonebraker, and M. E. S. Loomis. 1994. Objects and databases (panel). In Proc. 9th Annual Conference on Object-Oriented Programming Systems, Languages, and Applications, pp. 371–372. DOI: 10.1145/191080.191138.

H. Balakrishnan、M. Balazinska、D. Carney、U. Çetintemel、M. Cherniack、C. Convey、EF Galvez、J. Salz、M. Stonebraker、N. Tatbul、R. Tibbetts 和 SB Zdonik。2004年。奥罗拉回顾展。VLDB 杂志,13(4):370–383。DOI:10.1007/s00778-004-0133-5。228, 229

H. Balakrishnan, M. Balazinska, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, E. F. Galvez, J. Salz, M. Stonebraker, N. Tatbul, R. Tibbetts, and S. B. Zdonik. 2004. Retrospective on Aurora. VLDB Journal, 13(4): 370–383. DOI: 10.1007/s00778-004-0133-5. 228, 229

M. Balazinska、H. Balakrishnan 和 M. Stonebraker。2004b. Medusa 分布式流处理系统中的负载管理和高可用性。在过程中。ACM SIGMOD 国际数据管理会议,第 929–930 页。DOI:10.1145/1007568.1007701。325

M. Balazinska, H. Balakrishnan, and M. Stonebraker. 2004b. Load management and high availability in the Medusa distributed stream processing system. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 929–930. DOI: 10.1145/1007568.1007701. 325

M. Balazinska、H. Balakrishnan 和 M. Stonebraker。2004a. 联合分布式系统中基于合同的负载管理。在过程中。第一届 USENIX 网络系统设计与实现研讨会http://www.usenix.org/events/nsdi04/tech/balazinska.html。228, 230

M. Balazinska, H. Balakrishnan, and M. Stonebraker. 2004a. Contract-based load management in federated distributed systems. In Proc. 1st USENIX Symposium on Networked Systems Design and Implementation. http://www.usenix.org/events/nsdi04/tech/balazinska.html. 228, 230

M. Balazinska、H. Balakrishnan、S. Madden 和 M. Stonebraker。2005 年。Borealis 分布式流处理系统中的容错。在过程中。ACM SIGMOD 国际数据管理会议,第 13-24 页。DOI:10.1145/1066157.1066160。228、230、234、325

M. Balazinska, H. Balakrishnan, S. Madden, and M. Stonebraker. 2005. Fault-tolerance in the Borealis distributed stream processing system. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 13–24. DOI: 10.1145/1066157.1066160. 228, 230, 234, 325

M. Balazinska、H. Balakrishnan、S. Madden 和 M. Stonebraker。2008 年。borealis 分布式流处理系统的容错。ACM 数据库系统汇刊,33(1):3:1–3:44。DOI:10.1145/1331904.1331907。

M. Balazinska, H. Balakrishnan, S. Madden, and M. Stonebraker. 2008. Fault-tolerance in the borealis distributed stream processing system. ACM Transactions on Database Systems, 33(1): 3:1–3:44. DOI: 10.1145/1331904.1331907.

D. Barbará、JA Blakeley、DH Fishman、DB Lomet 和 M. Stonebraker。1994.数据库研究对工业产品的影响(小组摘要)。ACM SIGMOD 记录,23(3):35–40。DOI:10.1145/187436.187455。

D. Barbará, J. A. Blakeley, D. H. Fishman, D. B. Lomet, and M. Stonebraker. 1994. The impact of database research on industrial products (panel summary). ACM SIGMOD Record, 23(3): 35–40. DOI: 10.1145/187436.187455.

V.巴尔和M.斯通布雷克。2015a. 宝贵的一课,hadoop 何去何从?ACM 通讯,58(1):18-19。DOI:10.1145/2686591。50

V. Barr and M. Stonebraker. 2015a. A valuable lesson, and whither hadoop? Communications of the ACM, 58(1): 18–19. DOI: 10.1145/2686591. 50

V.巴尔和M.斯通布雷克。2015b. 男性如何在计算机科学领域帮助女性;荣获“计算机界诺贝尔奖”。ACM 通讯,58(11):10-11。DOI:10.1145/2820419。

V. Barr and M. Stonebraker. 2015b. How men can help women in cs; winning ’computing’s nobel prize’. Communications of the ACM, 58(11): 10–11. DOI: 10.1145/2820419.

V. Barr、M. Stonebraker、RC Fernandez、D. Deng 和 ML Brodie。2017。我们如何教授 cs2all,以及如何应对数据库衰退。ACM 通讯,60(1):10-11。http://dl.acm.org/itation.cfm?id=3014349

V. Barr, M. Stonebraker, R. C. Fernandez, D. Deng, and M. L. Brodie. 2017. How we teach cs2all, and what to do about database decay. Communications of the ACM, 60(1): 10–11. http://dl.acm.org/citation.cfm?id=3014349.

L.巴特尔、M.斯通布雷克和R.张。2013.交互式可视化的查询结果集的动态缩减。在过程中。2013 年 IEEE 大数据国际会议,第 1-8 页。DOI:10.1109/BigData.2013.6691708。

L. Battle, M. Stonebraker, and R. Chang. 2013. Dynamic reduction of query result sets for interactive visualizaton. In Proc. 2013 IEEE International Conference on Big Data, pp. 1–8. DOI: 10.1109/BigData.2013.6691708.

L.巴特尔、R.张和M.斯通布雷克。2016.用于交互式可视化的数据图块的动态预取。在过程中。ACM SIGMOD 国际数据管理会议,第 1363–1375 页。DOI:10.1145/2882903.2882919。

L. Battle, R. Chang, and M. Stonebraker. 2016. Dynamic prefetching of data tiles for interactive visualization. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 1363–1375. DOI: 10.1145/2882903.2882919.

R.伯曼和M.斯通布雷克。1977. GEO-OUEL:用于操作和显示地理数据的系统。在过程中。第四届计算机图形与交互技术年会,第 186-191 页。DOI:10.1145/563858.563892。

R. Berman and M. Stonebraker. 1977. GEO-OUEL: a system for the manipulation and display of geographic data. In Proc. 4th Annual Conference Computer Graphics and Interactive Techniques, pp. 186–191. DOI: 10.1145/563858.563892.

PA Bernstein、U. Dayal、DJ DeWitt、D. Gawlick、J. Gray、M. Jarke、BG Lindsay、PC Lockemann、D. Maier、EJ Neuhold、A. Reuter、LA Rowe、H. Schek、JW Schmidt、M .施雷夫和M.斯通布雷克。1989. DBMS 研究的未来方向——拉古纳海滩参与者。ACM SIGMOD 记录,18(1):17-26。92

P. A. Bernstein, U. Dayal, D. J. DeWitt, D. Gawlick, J. Gray, M. Jarke, B. G. Lindsay, P. C. Lockemann, D. Maier, E. J. Neuhold, A. Reuter, L. A. Rowe, H. Schek, J. W. Schmidt, M. Schrefl, and M. Stonebraker. 1989. Future directions in DBMS research—the Laguna Beach participants. ACM SIGMOD Record, 18(1): 17–26. 92

PA Bernstein、ML Brodie、S. Ceri、DJ DeWitt、MJ Franklin、H. Garcia-Molina、J. Gray、G. Held、JM Hellerstein、HV Jagadish、M. Lesk、D. Maier、JF Naughton、H. Pirahesh 、M.斯通布雷克和JD乌尔曼。1998a. 阿西洛玛数据库研究报告。ACM SIGMOD 记录,27(4):74–80。DOI:10.1145/306101.306137.92

P. A. Bernstein, M. L. Brodie, S. Ceri, D. J. DeWitt, M. J. Franklin, H. Garcia-Molina, J. Gray, G. Held, J. M. Hellerstein, H. V. Jagadish, M. Lesk, D. Maier, J. F. Naughton, H. Pirahesh, M. Stonebraker, and J. D. Ullman. 1998a. The Asilomar report on database research. ACM SIGMOD Record, 27(4): 74–80. DOI: 10.1145/306101.306137.92

PA Bernstein、ML Brodie、S. Ceri、DJ DeWitt、MJ Franklin、H. Garcia-Molina、J. Gray、G. Held、JM Hellerstein、HV Jagadish、M. Lesk、D. Maier、JF Naughton、H. Pirahesh 、M.斯通布雷克和JD乌尔曼。1998b. 阿西洛玛数据库研究报告。CoRR,cs.DB/9811013。http://arxiv.org/abs/cs.DB/9811013。92

P. A. Bernstein, M. L. Brodie, S. Ceri, D. J. DeWitt, M. J. Franklin, H. Garcia-Molina, J. Gray, G. Held, J. M. Hellerstein, H. V. Jagadish, M. Lesk, D. Maier, J. F. Naughton, H. Pirahesh, M. Stonebraker, and J. D. Ullman. 1998b. The Asilomar report on database research. CoRR, cs.DB/9811013. http://arxiv.org/abs/cs.DB/9811013. 92

A. Bhide 和 M. Stonebraker。1987 年。高性能事务处理架构中的性能问题。在过程中。第二届国际研讨会高性能交易系统,第 277-300 页。DOI:10.1007/3-540-51085-0_51。91

A. Bhide and M. Stonebraker. 1987. Performance issues in high performance transaction processing architectures. In Proc. 2nd International Workshop High Performance Transaction Systems, pp. 277–300. DOI: 10.1007/3-540-51085-0_51. 91

A. Bhide 和 M. Stonebraker。1988。快速事务处理的两种架构的性能比较。在过程中。第四届国际数据工程会议,第 536–545 页。DOI:10.1109/ICDE.1988.105501。91

A. Bhide and M. Stonebraker. 1988. A performance comparison of two architectures for fast transaction processing. In Proc. 4th International Conference on Data Engineering, pp. 536–545. DOI: 10.1109/ICDE.1988.105501. 91

ML Brodie 和 M. Stonebraker。1993.达尔文:关于遗留信息系统的增量迁移。技术报告 TR-0222-10-92-165,GTE 实验室公司。

M. L. Brodie and M. Stonebraker. 1993. Darwin: On the incremental migration of legacy information systems. Technical Report TR-0222-10-92-165, GTE Laboratories Incorporated.

ML Brodie 和 M. Stonebraker。1995a. 迁移遗留系统:网关、接口和增量方法。摩根·考夫曼。91

M. L. Brodie and M. Stonebraker. 1995a. Migrating Legacy Systems: Gateways, Interfaces, and the Incremental Approach. Morgan Kaufmann. 91

ML Brodie 和 M. Stonebraker。1995b. 遗留信息系统迁移:网关、接口和增量方法。摩根·考夫曼。

M. L. Brodie and M. Stonebraker. 1995b. Legacy Information Systems Migration: Gateways, Interfaces, and the Incremental Approach. Morgan Kaufmann.

ML Brodie、RM Michael Stonebraker 和 J. Pei。2018。应用程序和数据共同进化的案例。在新英格兰数据库时代

M. L. Brodie, R. M. Michael Stonebraker, and J. Pei. 2018. The case for the co-evolution of applications and data. In New England Database Days.

P. 布朗和 M. 斯通布雷克。1995. Bigsur:地球科学数据管理系统。在过程中。第 21 届超大型数据库国际会议,第 720-728 页。http://www.vldb.org/conf/1995/P720.pdf

P. Brown and M. Stonebraker. 1995. Bigsur: A system for the management of earth science data. In Proc. 21th International Conference on Very Large Data Bases, pp. 720–728. http://www.vldb.org/conf/1995/P720.pdf.

MJ 凯里和 M.斯通布雷克。1984.数据库管理系统并发控制算法的性能。在过程中。第十届超大型数据库国际会议,第 107-118 页。http://www.vldb.org/conf/1984/P107.pdf.91 , 200

M. J. Carey and M. Stonebraker. 1984. The performance of concurrency control algorithms for database management systems. In Proc. 10th International Conference on Very Large Data Bases, pp. 107–118. http://www.vldb.org/conf/1984/P107.pdf.91, 200

D. Carney、U. Çetintemel、M. Cherniack、C. Convey、S. Lee、G. Seidman、M. Stonebraker、N. Tatbul 和 SB Zdonik。2002. 监控流——一类新的数据管理应用程序。在过程中。第 28 届国际超大型数据库会议,第 215-226 页。DOI:10.1016/B978-155860869-6/50027-5。228、229、324

D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. B. Zdonik. 2002. Monitoring streams—A new class of data management applications. In Proc. 28th International Conference on Very Large Data Bases, pp. 215–226. DOI: 10.1016/B978-155860869-6/50027-5. 228, 229, 324

D. Carney、U. Çetintemel、A. Rasin、SB Zdonik、M. Cherniack 和 M. Stonebraker。2003.数据流管理器中的操作员调度。在过程中。第 29 届国际超大型数据库会议,第 838-849 页。http://www.vldb.org/conf/2003/papers/S25P02.pdf。228, 229

D. Carney, U. Çetintemel, A. Rasin, S. B. Zdonik, M. Cherniack, and M. Stonebraker. 2003. Operator scheduling in a data stream manager. In Proc. 29th International Conference on Very Large Data Bases, pp. 838–849. http://www.vldb.org/conf/2003/papers/S25P02.pdf. 228, 229

U. Çetintemel、J. Du、T. Kraska、S. Madden、D. Maier、J. Meehan、A. Pavlo、M. Stonebraker、E. Sutherland、N. Tatbul、K. Tufte、H. Wang 和 SB兹多尼克。2014. S-store:用于高速应用程序的流式 NewSQL 系统。过程。VLDB 捐赠,7(13):1633–1636。http://www.vldb.org/pvldb/vol7/p1633-cetintemel.pdf。234, 251

U. Çetintemel, J. Du, T. Kraska, S. Madden, D. Maier, J. Meehan, A. Pavlo, M. Stonebraker, E. Sutherland, N. Tatbul, K. Tufte, H. Wang, and S. B. Zdonik. 2014. S-store: A streaming NewSQL system for big velocity applications. Proc. VLDB Endowment, 7(13): 1633–1636. http://www.vldb.org/pvldb/vol7/p1633-cetintemel.pdf. 234, 251

U. Çetintemel、DJ Abadi、Y. Ahmad、H. Balakrishnan、M. Balazinska、M. Cherniack、J. Hwang、S. Madden、A. Maskey、A. Rasin、E. Ryvkina、M. Stonebraker、N. Tatbul 、Y. Xing 和 S. Zdonik。2016.Aurora 和 Borealis 流处理引擎。MN Garofalakis、J. Gehrke 和 R. Rastogi 编辑,《数据流管理—处理高速数据流》,第 337-359 页。施普林格。ISBN 978-3-540-28607-3。DOI:10.1007/978-3-540-28608-0_17。

U. Çetintemel, D. J. Abadi, Y. Ahmad, H. Balakrishnan, M. Balazinska, M. Cherniack, J. Hwang, S. Madden, A. Maskey, A. Rasin, E. Ryvkina, M. Stonebraker, N. Tatbul, Y. Xing, and S. Zdonik. 2016. The Aurora and Borealis stream processing engines. In M. N. Garofalakis, J. Gehrke, and R. Rastogi, editors, Data Stream Management—Processing High-Speed Data Streams, pp. 337–359. Springer. ISBN 978-3-540-28607-3. DOI: 10.1007/978-3-540-28608-0_17.

R.钱德拉、A.塞格夫和M.斯通布雷克。1994.在下一代数据库中实施日历和时间规则。在过程中。第十届国际数据工程会议,第 264-273 页。DOI:10.1109/ICDE.1994.283040。91

R. Chandra, A. Segev, and M. Stonebraker. 1994. Implementing calendars and temporal rules in next generation databases. In Proc. 10th International Conference on Data Engineering, pp. 264–273. DOI: 10.1109/ICDE.1994.283040. 91

S. Chaudhuri、AK Chandra、U. Dayal、J. Gray、M. Stonebraker、G. Wiederhold 和 MY Vardi。1996 年,数据库研究:引领、跟随还是回避?——小组摘要。在过程中。第 12 届国际数据工程会议,第 12 页。190.

S. Chaudhuri, A. K. Chandra, U. Dayal, J. Gray, M. Stonebraker, G. Wiederhold, and M. Y. Vardi. 1996. Database research: Lead, follow, or get out of the way?—panel abstract. In Proc. 12th International Conference on Data Engineering, p. 190.

P. Chen、V. Gadepally 和 M. Stonebraker。2016。BigDAWG 监控框架。在过程中。2016 年 IEEE 高性能极限计算会议,第 1-6 页。DOI:10.1109/HPEC.2016.7761642。第373章

P. Chen, V. Gadepally, and M. Stonebraker. 2016. The BigDAWG monitoring framework. In Proc. 2016 IEEE High Performance Extreme Computing Conference, pp. 1–6. DOI: 10.1109/HPEC.2016.7761642. 373

Y. Chi、CR Mechoso、M. Stonebraker、K. Sklower、R. Troy、RR Muntz 和 E. Mesrobian。1997.ESMDIS:地球系统模型数据信息系统。在过程中。第九届国际科学和统计数据库管理会议,第 116-118 页。DOI:10.1109/SSDM.1997.621169。

Y. Chi, C. R. Mechoso, M. Stonebraker, K. Sklower, R. Troy, R. R. Muntz, and E. Mesrobian. 1997. ESMDIS: earth system model data information system. In Proc. 9th International Conference on Scientific and Statistical Database Management, pp. 116–118. DOI: 10.1109/SSDM.1997.621169.

P. Cudré-Mauroux、H. Kimura、K. Lim、J. Rogers、R. Simakov、E. Soroush、P. Velikhov、DL Wang、M. Balazinska、J. Becla、DJ DeWitt、B. Heath、D. Maier、S. Madden、JM Patel、M. Stonebraker 和 SB Zdonik。2009. SciDB 演示:面向科学的 DBMS。过程。VLDB 捐赠,2(2):1534–1537。DOI:10.14778/1687553.1687584。

P. Cudré-Mauroux, H. Kimura, K. Lim, J. Rogers, R. Simakov, E. Soroush, P. Velikhov, D. L. Wang, M. Balazinska, J. Becla, D. J. DeWitt, B. Heath, D. Maier, S. Madden, J. M. Patel, M. Stonebraker, and S. B. Zdonik. 2009. A demonstration of SciDB: A science-oriented DBMS. Proc. VLDB Endowment, 2(2): 1534–1537. DOI: 10.14778/1687553.1687584.

J. DeBrabant、A. Pavlo、S. Tu、M. Stonebraker 和 SB Zdonik。2013。反缓存:数据库管理系统架构的新方法。过程。VLDB 捐赠,6(14):1942–1953。http://www.vldb.org/pvldb/vol6/p1942-debrabant.pdf

J. DeBrabant, A. Pavlo, S. Tu, M. Stonebraker, and S. B. Zdonik. 2013. Anti-caching: A new approach to database management system architecture. Proc. VLDB Endowment, 6(14): 1942–1953. http://www.vldb.org/pvldb/vol6/p1942-debrabant.pdf.

J. DeBrabant、J. Arulraj、A. Pavlo、M. Stonebraker、SB Zdonik 和 S. Dulloor。2014.非易失性存储器的 OLTP 数据库系统简介。在过程中。第五届使用现代处理器和存储架构加速数据管理系统国际研讨会,第 57-63 页。http://www.adms-conf.org/2014/adms14_debrabant.pdf

J. DeBrabant, J. Arulraj, A. Pavlo, M. Stonebraker, S. B. Zdonik, and S. Dulloor. 2014. A prolegomenon on OLTP database systems for non-volatile memory. In Proc. 5th International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures, pp. 57–63. http://www.adms-conf.org/2014/adms14_debrabant.pdf.

D. Deng、RC Fernandez、Z. Abedjan、S. Wang、M. Stonebraker、AK Elmagarmid、IF Ilyas、S. Madden、M. Ouzzani 和 N. Tang。2017a. 数据文明者系统。在过程中。第八届创新数据系统研究双年会http://cidrdb.org/cidr2017/papers/p44-deng-cidr17.pdf。293

D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang, M. Stonebraker, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, and N. Tang. 2017a. The data civilizer system. In Proc. 8th Biennial Conference on Innovative Data Systems Research. http://cidrdb.org/cidr2017/papers/p44-deng-cidr17.pdf. 293

D. Deng、A. Kim、S. Madden 和 M. Stonebraker。2017b. SILKMOTH:一种查找具有最大匹配约束的相关集的有效方法。CoRR,abs/1704.04738。http://arxiv.org/abs/1704.04738

D. Deng, A. Kim, S. Madden, and M. Stonebraker. 2017b. SILKMOTH: an efficient method for finding related sets with maximum matching constraints. CoRR, abs/1704.04738. http://arxiv.org/abs/1704.04738.

DJ DeWitt、RH Katz、F. Olken、LD Shapiro、M. Stonebraker 和 DA Wood。1984.主存数据库系统的实现技术。在过程中。ACM SIGMOD 国际数据管理会议,第 1-8 页。DOI:10.1145/602259.602261。111

D. J. DeWitt, R. H. Katz, F. Olken, L. D. Shapiro, M. Stonebraker, and D. A. Wood. 1984. Implementation techniques for main memory database systems. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 1–8. DOI: 10.1145/602259.602261. 111

DJ DeWitt 和 M. Stonebraker。2008 年 1 月。MapReduce:一大倒退。数据库列http://homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_backwards.html。访问日期:2018 年 4 月 8 日。50、114、136、184、209

D. J. DeWitt and M. Stonebraker. January 2008. MapReduce: A major step backwards. The Database Column. http://homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_backwards.html. Accessed April 8, 2018. 50, 114, 136, 184, 209

DJ DeWitt、IF Ilyas、JF Naughton 和 M. Stonebraker。2013 年。我们正淹没在最少可出版单位 (lpus) 的海洋中。在过程中。ACM SIGMOD 国际数据管理会议,第 921–922 页。DOI:10.1145/2463676.2465345。

D. J. DeWitt, I. F. Ilyas, J. F. Naughton, and M. Stonebraker. 2013. We are drowning in a sea of least publishable units (lpus). In Proc. ACM SIGMOD International Conference on Management of Data, pp. 921–922. DOI: 10.1145/2463676.2465345.

P. Dobbins、T. Dohzen、C. Grant、J. Hammer、M. Jones、D. Oliver、M. Pamuk、J. Shin 和 M. Stonebraker。2007 年。Morpheus 2.0:数据转换管理系统。在过程中。第三届数据库互操作性国际研讨会

P. Dobbins, T. Dohzen, C. Grant, J. Hammer, M. Jones, D. Oliver, M. Pamuk, J. Shin, and M. Stonebraker. 2007. Morpheus 2.0: A data transformation management system. In Proc. 3rd International Workshop on Database Interoperability.

T. Dohzen、M. Pamuk、J. Hammer 和 M. Stonebraker。2006 年。Morpheus 项目中通过转换重用进行数据集成。在过程中。ACM SIGMOD 国际数据管理会议,第 736–738 页。DOI:10.1145/1142473.1142571。

T. Dohzen, M. Pamuk, J. Hammer, and M. Stonebraker. 2006. Data integration through transform reuse in the Morpheus project. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 736–738. DOI: 10.1145/1142473.1142571.

J. Dozier、M. Stonebraker 和 J. Frew。1994. Sequoia 2000:用于研究全球变化的下一代信息系统。在过程中。第 13 届 IEEE 海量存储系统研讨会,第 47-56 页。DOI:10.1109/MASS.1994.373028。

J. Dozier, M. Stonebraker, and J. Frew. 1994. Sequoia 2000: A next-generation information system for the study of global change. In Proc. 13th IEEE Symposium Mass Storage Systems, pp. 47–56. DOI: 10.1109/MASS.1994.373028.

J.杜根和M.斯通布雷克。2014。阵列数据库的增量弹性。在过程中。ACM SIGMOD 国际数据管理会议,第 409-420 页。DOI:10.1145/2588555.2588569。

J. Duggan and M. Stonebraker. 2014. Incremental elasticity for array databases. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 409–420. DOI: 10.1145/2588555.2588569.

J. Duggan、AJ Elmore、M. Stonebraker、M. Balazinska、B. Howe、J. Kepner、S. Madden、D. Maier、T. Mattson 和 SB Zdonik。2015a. BigDAWG Polystore 系统。ACM SIGMOD 记录,44(2):11-16。DOI:10.1145/2814710.2814713。第284章

J. Duggan, A. J. Elmore, M. Stonebraker, M. Balazinska, B. Howe, J. Kepner, S. Madden, D. Maier, T. Mattson, and S. B. Zdonik. 2015a. The BigDAWG polystore system. ACM SIGMOD Record, 44(2): 11–16. DOI: 10.1145/2814710.2814713. 284

J. Duggan、O. Papaemmanouil、L. Battle 和 M. Stonebraker。2015b. 阵列数据库的倾斜感知连接优化。在过程中。ACM SIGMOD 国际数据管理会议,第 123–135 页。DOI:10.1145/2723372.2723709。

J. Duggan, O. Papaemmanouil, L. Battle, and M. Stonebraker. 2015b. Skew-aware join optimization for array databases. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 123–135. DOI: 10.1145/2723372.2723709.

A. Dziedzic、J. Duggan、AJ Elmore、V. Gadepally 和 M. Stonebraker。2015.BigDAWG:用于多种交互式应用程序的多存储。交互式分析数据系统研讨会

A. Dziedzic, J. Duggan, A. J. Elmore, V. Gadepally, and M. Stonebraker. 2015. BigDAWG: a polystore for diverse interactive applications. Data Systems for Interactive Analysis Workshop.

A. Dziedzic、AJ Elmore 和 M. Stonebraker。2016. Polystore 中的数据转换和迁移。在过程中。2016 年 IEEE 高性能极限计算会议,第 1-6 页。DOI:10.1109/HPEC.2016.7761594。第372章

A. Dziedzic, A. J. Elmore, and M. Stonebraker. 2016. Data transformation and migration in polystores. In Proc. 2016 IEEE High Performance Extreme Computing Conference, pp. 1–6. DOI: 10.1109/HPEC.2016.7761594. 372

AJ Elmore、J. Duggan、M. Stonebraker、M. Balazinska、U. Çetintemel、V. Gadepally、J. Heer、B. Howe、J. Kepner、T. Kraska、S. Madden、D. Maier、TG Mattson、 S.帕帕多普洛斯、J.帕克赫斯特、N.塔特布尔、M.瓦尔塔克和S.兹多尼克。2015 年。Big-DAWG polystore 系统演示。过程。VLDB 捐赠基金,8(12):1908–1911。http://www.vldb.org/pvldb/vol8/p1908-Elmore.pdf。287, 371

A. J. Elmore, J. Duggan, M. Stonebraker, M. Balazinska, U. Çetintemel, V. Gadepally, J. Heer, B. Howe, J. Kepner, T. Kraska, S. Madden, D. Maier, T. G. Mattson, S. Papadopoulos, J. Parkhurst, N. Tatbul, M. Vartak, and S. Zdonik. 2015. A demonstration of the Big-DAWG polystore system. Proc. VLDB Endowment, 8(12): 1908–1911. http://www.vldb.org/pvldb/vol8/p1908-Elmore.pdf. 287, 371

RS Epstein 和 M. Stonebraker。1980.分布式数据库处理策略分析。在过程中。第六届非常数据库国际会议,第 92-101 页。

R. S. Epstein and M. Stonebraker. 1980. Analysis of distributed data base processing strategies. In Proc. 6th International Conference on Very Data Bases, pp. 92–101.

RS Epstein、M. Stonebraker 和 E. Wong。1978 年。关系数据库系统中的分布式查询处理。在过程中。ACM SIGMOD 国际数据管理会议,第 169–180 页。DOI:10.1145/509252.509292。198

R. S. Epstein, M. Stonebraker, and E. Wong. 1978. Distributed query processing in a relational data base system. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 169–180. DOI: 10.1145/509252.509292. 198

RC Fernandez、Z. Abedjan、S. Madden 和 M. Stonebraker。2016 年。迈向大规模数据发现:立场文件。在过程中。第三届数据库和网络探索性搜索国际研讨会,第 3-5 页。DOI:10.1145/2948674.2948675。

R. C. Fernandez, Z. Abedjan, S. Madden, and M. Stonebraker. 2016. Towards large-scale data discovery: position paper. In Proc. 3rd International Workshop on Exploratory Search in Databases and the Web, pp. 3–5. DOI: 10.1145/2948674.2948675.

RC Fernandez、D. Deng、E. Mansour、AA Qahtan、W. Tao、Z. Abedjan、AK Elmagarmid、IF Ilyas、S. Madden、M. Ouzzani、M. Stonebraker 和 N. Tang。2017b. 数据文明器系统的演示。在过程中。ACM SIGMOD 国际数据管理会议,第 1639–1642 页。DOI:10.1145/3035918.3058740。

R. C. Fernandez, D. Deng, E. Mansour, A. A. Qahtan, W. Tao, Z. Abedjan, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. 2017b. A demo of the data civilizer system. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 1639–1642. DOI: 10.1145/3035918.3058740.

RC Fernandez、D. Deng、E. Mansour、AA Qahtan、W. Tao、Z. Abedjan、A. Elmagarmid、IF Ilyas、S. Madden、M. Ouzzani、M. Stonebraker 和 N. Tang。2017a. 数据文明器系统的演示。在过程中。ACM SIGMOD 国际数据管理会议,第 1636–1642 页。293

R. C. Fernandez, D. Deng, E. Mansour, A. A. Qahtan, W. Tao, Z. Abedjan, A. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. 2017a. A demo of the data civilizer system. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 1636–1642. 293

RC Fernandez、Z. Abedjan、F. Koko、G. Yuan、S. Madden 和 M. Stonebraker。2018a. Aurum:数据发现系统。在过程中。第 34 届国际数据工程会议,第 1001–1012 页。

R. C. Fernandez, Z. Abedjan, F. Koko, G. Yuan, S. Madden, and M. Stonebraker. 2018a. Aurum: A data discovery system. In Proc. 34th International Conference on Data Engineering, pp. 1001–1012.

RC Fernandez、E. Mansour、A. Qahtan、A. Elmagarmid、I. Ilyas、S. Madden、M. Ouzzani、M. Stonebraker 和 N. Tang。2018b. 渗透语义:使用词嵌入链接数据集以进行数据发现。在过程中。第 34 届国际数据工程会议,第 989-1000 页。

R. C. Fernandez, E. Mansour, A. Qahtan, A. Elmagarmid, I. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. 2018b. Seeping semantics: Linking datasets using word embeddings for data discovery. In Proc. 34th International Conference on Data Engineering, pp. 989–1000.

V. Gadepally、P. Chen、J. Duggan、AJ Elmore、B. Haynes、J. Kepner、S. Madden、T. Mattson 和 M. Stonebraker。2016a. BigDAWG Polystore 系统和架构。在过程中。2016 年 IEEE 高性能极限计算会议,第 1-6 页。DOI:10.1109/HPEC.2016.7761636。287, 373

V. Gadepally, P. Chen, J. Duggan, A. J. Elmore, B. Haynes, J. Kepner, S. Madden, T. Mattson, and M. Stonebraker. 2016a. The BigDAWG polystore system and architecture. In Proc. 2016 IEEE High Performance Extreme Computing Conference, pp. 1–6. DOI: 10.1109/HPEC.2016.7761636. 287, 373

V. Gadepally、P. Chen、J. Duggan、AJ Elmore、B. Haynes、J. Kepner、S. Madden、T. Mattson 和 M. Stonebraker。2016b. BigDAWG Polystore 系统和架构。CoRR,abs/1609.07548。http://arxiv.org/abs/1609.07548

V. Gadepally, P. Chen, J. Duggan, A. J. Elmore, B. Haynes, J. Kepner, S. Madden, T. Mattson, and M. Stonebraker. 2016b. The BigDAWG polystore system and architecture. CoRR, abs/1609.07548. http://arxiv.org/abs/1609.07548.

V. Gadepally、P. Chen、J. Duggan、A. Elmore、B. Haynes、J. Kepnera、S. Madden、T. Mattson 和 M. Stonebraker。2016c. BigDAWG Polystore 系统和架构。在过程中。2016 年 IEEE 高性能极限计算会议,第 1-6 页。DOI:10.1109/HPEC.2016.7761636。

V. Gadepally, P. Chen, J. Duggan, A. Elmore, B. Haynes, J. Kepnera, S. Madden, T. Mattson, and M. Stonebraker. 2016c. The BigDAWG polystore system and architecture. In Proc. 2016 IEEE High Performance Extreme Computing Conference, pp. 1–6. DOI: 10.1109/HPEC.2016.7761636.

V. Gadepally、J. Duggan、AJ Elmore、J. Kepner、S. Madden、T. Mattson 和 M. Stonebraker。2016d. BigDAWG 架构。CoRR,abs/1602.08791。http://arxiv.org/abs/1602.08791

V. Gadepally, J. Duggan, A. J. Elmore, J. Kepner, S. Madden, T. Mattson, and M. Stonebraker. 2016d. The BigDAWG architecture. CoRR, abs/1602.08791. http://arxiv.org/abs/1602.08791.

A. Go、M. Stonebraker 和 C. Williams。1975.一种实施地理数据系统的方法。在过程中。交互设计数据库研讨会,第 67-77 页。

A. Go, M. Stonebraker, and C. Williams. 1975. An approach to implementing a geo-data system. In Proc. Workshop on Data Bases for Interactive Design, pp. 67–77.

J. Gray、H. Schek、M. Stonebraker 和 JD Ullman。2003 年。洛厄尔报告。在过程中。ACM SIGMOD 国际数据管理会议,p。680.DOI:10.1145/872757.872873。92

J. Gray, H. Schek, M. Stonebraker, and J. D. Ullman. 2003. The Lowell report. In Proc. ACM SIGMOD International Conference on Management of Data, p. 680. DOI: 10.1145/872757.872873. 92

MN 古巴诺夫和 M.斯通布雷克。2013. 在网络规模上引导同义词解析。在过程中。DIMACS/CCICADA 大数据集成研讨会

M. N. Gubanov and M. Stonebraker. 2013. Bootstraping synonym resolution at web scale. In Proc. DIMACS/CCICADA Workshop on Big Data Integration.

MN 古巴诺夫和 M.斯通布雷克。2014.大规模语义轮廓提取。在过程中。第 17 届扩展数据库技术国际会议,第 644-647 页。DOI:10.5441/002/edbt.2014.64。

M. N. Gubanov and M. Stonebraker. 2014. Large-scale semantic profile extraction. In Proc. 17th International Conference on Extending Database Technology, pp. 644–647. DOI: 10.5441/002/edbt.2014.64.

MN 古巴诺夫、M.斯通布雷克和 D.布鲁克纳。2014. Data Tamer 中的大规模文本和结构化数据融合。在过程中。第 30 届国际数据工程会议,第 1258–1261 页。DOI:10.1109/ICDE.2014.6816755。

M. N. Gubanov, M. Stonebraker, and D. Bruckner. 2014. Text and structured data fusion in data tamer at scale. In Proc. 30th International Conference on Data Engineering, pp. 1258–1261. DOI: 10.1109/ICDE.2014.6816755.

AM Gupta、V. Gadepally 和 M. Stonebraker。2016。联合数据库系统中的跨引擎查询执行。在过程中。2016 年 IEEE 高性能极限计算会议,第 1-6 页。DOI:10.1109/HPEC.2016.7761648。第373章

A. M. Gupta, V. Gadepally, and M. Stonebraker. 2016. Cross-engine query execution in federated database systems. In Proc. 2016 IEEE High Performance Extreme Computing Conference, pp. 1–6. DOI: 10.1109/HPEC.2016.7761648. 373

A. 格特曼和 M. 斯通布雷克。1982.使用关系数据库管理系统来存储计算机辅助设计数据。季刊 IEEE 数据工程技术委员会,5(2):21–28。http://sites.computer.org/debull/82JUN-CD.pdf。201

A. Guttman and M. Stonebraker. 1982. Using a relational database management system for computer aided design data. Quarterly Bulletin IEEE Technical Committee on Data Engineering, 5(2): 21–28. http://sites.computer.org/debull/82JUN-CD.pdf. 201

J. Hammer、M. Stonebraker 和 O. Topsakal。2005. THALIA:用于评估遗留信息集成方法的测试工具。在过程中。第 21 届国际数据工程会议,第 485–486 页。DOI:10.1109/ICDE.2005.140。

J. Hammer, M. Stonebraker, and O. Topsakal. 2005. THALIA: test harness for the assessment of legacy information integration approaches. In Proc. 21st International Conference on Data Engineering, pp. 485–486. DOI: 10.1109/ICDE.2005.140.

R. Harding、DV Aken、A. Pavlo 和 M. Stonebraker。2017.分布式并发控制的评估。过程。VLDB 捐赠,10(5):553–564。DOI:10.14778/3055540.3055548。

R. Harding, D. V. Aken, A. Pavlo, and M. Stonebraker. 2017. An evaluation of distributed concurrency control. Proc. VLDB Endowment, 10(5): 553–564. DOI: 10.14778/3055540.3055548.

S. Harizopoulos、DJ Abadi、S. Madden 和 M. Stonebraker。2008 年。OLTP 透过镜子,以及我们在那里发现了什么。在过程中。ACM SIGMOD 国际数据管理会议,第 981–992 页。DOI:10.1145/1376616.1376713。152、246、251、346

S. Harizopoulos, D.J. Abadi, S. Madden, and M. Stonebraker. 2008. OLTP through the looking glass, and what we found there. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 981–992. DOI: 10.1145/1376616.1376713. 152, 246, 251, 346

PB Hawthorn 和 M. Stonebraker。1979.关系数据库管理系统的性能分析。在过程中。ACM SIGMOD 国际数据管理会议,第 1-12 页。DOI:10.1145/582095.582097。

P. B. Hawthorn and M. Stonebraker. 1979. Performance analysis of a relational data base management system. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 1–12. DOI: 10.1145/582095.582097.

G. 赫尔德和 M. 斯通布雷克。1975. 关系数据库管理系统 INGRES 中的存储结构和访问方法。在过程中。ACM Pacific 75—数据:其使用、组织和管理,第 26-33 页。194

G. Held and M. Stonebraker. 1975. Storage structures and access methods in the relational data base management system INGRES. In Proc. ACM Pacific 75—Data: Its Use, Organization and Management, pp. 26–33. 194

G. 赫尔德和 M. 斯通布雷克。1978 年。重新审视 B 树。ACM 通讯,21(2):139–143。DOI:10.1145/359340.359348。90, 197

G. Held and M. Stonebraker. 1978. B-trees re-examined. Communications of the ACM, 21(2): 139–143. DOI: 10.1145/359340.359348. 90, 197

G. Held、M. Stonebraker 和 E. Wong。1975. INGRES:关系数据库系统。国家计算机会议,第 409-416 页。DOI: 10.1145/1499949.1500029.102, 397

G. Held, M. Stonebraker, and E. Wong. 1975. INGRES: A relational data base system. In National Computer Conference, pp. 409–416. DOI: 10.1145/1499949.1500029.102, 397

JM 海勒斯坦和 M.斯通布雷克。1993. 谓词迁移:使用昂贵的谓词优化查询。在过程中。ACM SIGMOD 国际数据管理会议,第 267-276 页。DOI:10.1145/170035.170078。

J. M. Hellerstein and M. Stonebraker. 1993. Predicate migration: Optimizing queries with expensive predicates. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 267–276. DOI: 10.1145/170035.170078.

JM 海勒斯坦和 M.斯通布雷克。2005 年。数据库系统读物,4。麻省理工学院出版社。ISBN 978-0-262-69314-1。http://mitpress.mit.edu/books/readings-database-systems

J. M. Hellerstein and M. Stonebraker. 2005. Readings in Database Systems, 4. MIT Press. ISBN 978-0-262-69314-1. http://mitpress.mit.edu/books/readings-database-systems.

JM Hellerstein、M. Stonebraker 和 R. Caccia。1999. 独立、开放的企业数据集成。季刊 IEEE 数据工程技术委员会,22(1):43–49。http://sites.computer.org/debull/99mar/cohera.ps

J. M. Hellerstein, M. Stonebraker, and R. Caccia. 1999. Independent, open enterprise data integration. Quarterly Bulletin IEEE Technical Committee on Data Engineering, 22(1): 43–49. http://sites.computer.org/debull/99mar/cohera.ps.

JM 海勒斯坦、M.斯通布雷克和 JR 汉密尔顿。2007.数据库系统的架构。数据库的基础和趋势,1(2):141–259。DOI:10.1561/1900000002。

J. M. Hellerstein, M. Stonebraker, and J. R. Hamilton. 2007. Architecture of a database system. Foundations and Trends in Databases, 1(2): 141–259. DOI: 10.1561/1900000002.

W. Hong 和 M. Stonebraker。1991. XPRS 中并行查询执行计划的优化。在过程中。第一届并行和分布式信息系统国际会议,第 218-225 页。DOI:10.1109/PDIS.1991.183106。

W. Hong and M. Stonebraker. 1991. Optimization of parallel query execution plans in XPRS. In Proc. 1st International Conference on Parallel and Distributed Information Systems, pp. 218–225. DOI: 10.1109/PDIS.1991.183106.

W. Hong 和 M. Stonebraker。1993. XPRS 中并行查询执行计划的优化。分布式和并行数据库,1(1):9–32。DOI:10.1007/BF01277518。

W. Hong and M. Stonebraker. 1993. Optimization of parallel query execution plans in XPRS. Distributed and Parallel Databases, 1(1): 9–32. DOI: 10.1007/BF01277518.

J. Hwang、M. Balazinska、A. Rasin、U. Çetintemel、M. Stonebraker 和 SB Zdonik。2005.分布式流处理的高可用性算法。在过程中。第 21 届国际数据工程会议,第 779-790 页。DOI:10.1109/ICDE.2005.72。228、230、325

J. Hwang, M. Balazinska, A. Rasin, U. Çetintemel, M. Stonebraker, and S. B. Zdonik. 2005. High-availability algorithms for distributed stream processing. In Proc. 21st International Conference on Data Engineering, pp. 779–790. DOI: 10.1109/ICDE.2005.72. 228, 230, 325

A. Jhinran 和 M. Stonebraker。1990。复杂对象表示的替代方案:性能视角。在过程中。第六届国际数据工程会议,第 94-102 页。DOI:10.1109/ICDE.1990.113458。

A. Jhingran and M. Stonebraker. 1990. Alternatives in complex object representation: A performance perspective. In Proc. 6th International Conference on Data Engineering, pp. 94–102. DOI: 10.1109/ICDE.1990.113458.

A. Jindal、P. Rawlani、E. Wu、S. Madden、A. Deshpande 和 M. Stonebraker。2014. VERTEXICA:您的图形分析关系朋友!过程。VLDB 捐赠,7(13):1669–1672。http://www.vldb.org/pvldb/vol7/p1669-jindal.pdf

A. Jindal, P. Rawlani, E. Wu, S. Madden, A. Deshpande, and M. Stonebraker. 2014. VERTEXICA: your relational friend for graph analytics! Proc. VLDB Endowment, 7(13): 1669–1672. http://www.vldb.org/pvldb/vol7/p1669-jindal.pdf.

R. Kallman、H. Kimura、J. Natkins、A. Pavlo、A. Rasin、SB Zdonik、EPC Jones、S. Madden、M. Stonebraker、Y. 张、J. Hugg 和 DJ Abadi。2008. H-store:高性能、分布式主存事务处理系统。过程。VLDB 捐赠,1(2):1496–1499。DOI:10.14778/1454159.1454211。247、249、341

R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. B. Zdonik, E. P. C. Jones, S. Madden, M. Stonebraker, Y. Zhang, J. Hugg, and D.J. Abadi. 2008. H-store: a high-performance, distributed main memory transaction processing system. Proc. VLDB Endowment, 1(2): 1496–1499. DOI: 10.14778/1454159.1454211. 247, 249, 341

RH Katz、JK Ousterhout、DA Patterson 和 M. Stonebraker。1988. 高性能 I/O 子系统项目。季刊 IEEE 数据工程技术委员会,11(1):40–47。http://sites.computer.org/debull/88MAR-CD.pdf

R. H. Katz, J. K. Ousterhout, D. A. Patterson, and M. Stonebraker. 1988. A project on high performance I/O subsystems. Quarterly Bulletin IEEE Technical Committee on Data Engineering, 11(1): 40–47. http://sites.computer.org/debull/88MAR-CD.pdf.

JT Kohl、C. Staelin 和 M. Stonebraker。1993a. 亮点:使用日志结构的文件系统进行三级存储管理。在过程中。1993 年 Usenix 冬季技术会议,第 435–448 页。

J. T. Kohl, C. Staelin, and M. Stonebraker. 1993a. Highlight: Using a log-structured file system for tertiary storage management. In Proc. of the Usenix Winter 1993 Technical Conference, pp. 435–448.

JT Kohl、M. Stonebraker 和 C. Staelin。1993b. 亮点:三级存储的文件系统。在过程中。第 12 届 IEEE 海量存储系统研讨会,第 157–161 页。DOI:10.1109/MASS.1993.289765。

J. T. Kohl, M. Stonebraker, and C. Staelin. 1993b. Highlight: a file system for tertiary storage. In Proc. 12th IEEE Symposium Mass Storage Systems, pp. 157–161. DOI: 10.1109/MASS.1993.289765.

CP Kolovson 和 M. Stonebraker。1989。历史数据库的索引技术。在过程中。第五届国际数据工程会议,第 127-137 页。DOI:10.1109/ICDE.1989.47208。

C. P. Kolovson and M. Stonebraker. 1989. Indexing techniques for historical databases. In Proc. 5th International Conference on Data Engineering, pp. 127–137. DOI: 10.1109/ICDE.1989.47208.

CP Kolovson 和 M. Stonebraker。1991.分段索引:多维区间数据的动态索引技术。在过程中。ACM SIGMOD 国际数据管理会议,第 138–147 页。DOI:10.1145/115790.115807。

C. P. Kolovson and M. Stonebraker. 1991. Segment indexes: Dynamic indexing techniques for multi-dimensional interval data. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 138–147. DOI: 10.1145/115790.115807.

RA Kowalski、DB Lenat、E. Soloway、M. Stonebraker 和 A. Walker。1988 年。知识管理——小组报告。在过程中。第二届国际专家数据库系统会议,第 63-69 页。

R. A. Kowalski, D. B. Lenat, E. Soloway, M. Stonebraker, and A. Walker. 1988. Knowledge management—panel report. In Proc. 2nd International Conference on Expert Database Systems, pp. 63–69.

A. 库马尔和 M. 斯通布雷克。1987a. 连接选择性对最佳嵌套顺序的影响。ACM SIGMOD 记录,16(1):28–41。DOI:10.1145/24820.24822。

A. Kumar and M. Stonebraker. 1987a. The effect of join selectivities on optimal nesting order. ACM SIGMOD Record, 16(1): 28–41. DOI: 10.1145/24820.24822.

A. 库马尔和 M. 斯通布雷克。1987b. 操作系统事务管理器的性能评估。在过程中。第 13 届超大型数据库国际会议,第 473-481 页。http://www.vldb.org/conf/1987/P473.pdf

A. Kumar and M. Stonebraker. 1987b. Performance evaluation of an operating system transaction manager. In Proc. 13th International Conference on Very Large Data Bases, pp. 473–481. http://www.vldb.org/conf/1987/P473.pdf.

A. 库马尔和 M. 斯通布雷克。1988.基于语义的复制数据事务管理技术。在过程中。ACM SIGMOD 国际数据管理会议,第 117-125 页。DOI:10.1145/50202.50215。

A. Kumar and M. Stonebraker. 1988. Semantics based transaction management techniques for replicated data. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 117–125. DOI: 10.1145/50202.50215.

A. 库马尔和 M. 斯通布雷克。1989。操作系统事务管理器的性能考虑。IEEE 软件工程汇刊,15(6):705–714。DOI:10.1109/32.24724。

A. Kumar and M. Stonebraker. 1989. Performance considerations for an operating system transaction manager. IEEE Transactions on Software Engineering, 15(6): 705–714. DOI: 10.1109/32.24724.

R. Kung、EN Hanson、YE Ioannidis、TK Sellis、LD Shapiro 和 M. Stonebraker。1984.数据库系统中的启发式搜索。在过程中。第一届专家数据库系统国际研讨会,第 537-548 页。

R. Kung, E. N. Hanson, Y. E. Ioannidis, T. K. Sellis, L. D. Shapiro, and M. Stonebraker. 1984. Heuristic search in data base systems. In Proc. 1st International Workshop on Expert Database Systems, pp. 537–548.

CA 林奇和 M.斯通布雷克。1988 年。扩展了用户定义的索引并应用于文本数据库。在过程中。第 14 届超大型数据库国际会议,第 306-317 页。http://www.vldb.org/conf/1988/P306.pdf

C. A. Lynch and M. Stonebraker. 1988. Extended user-defined indexing with application to textual databases. In Proc. 14th International Conference on Very Large Data Bases, pp. 306–317. http://www.vldb.org/conf/1988/P306.pdf.

N. Malviya、A. Weisberg、S. Madden 和 M. Stonebraker。2014 年。重新思考主内存 OLTP 恢复。在过程中。第 30 届国际数据工程会议,第 604-615 页。DOI:10.1109/ICDE.2014.6816685。

N. Malviya, A. Weisberg, S. Madden, and M. Stonebraker. 2014. Rethinking main memory OLTP recovery. In Proc. 30th International Conference on Data Engineering, pp. 604–615. DOI: 10.1109/ICDE.2014.6816685.

E. Mansour、D. Deng、A. Qahtan、RC Fernandez、Wenbo、Z. Abedjan、A. Elmagarmid、I. Ilyas、S. Madden、M. Ouzzani、M. Stonebraker 和 N. Tang。2018。使用先进的工作流引擎构建数据文明器管道。在过程中。第 34 届国际数据工程会议,第 1593–1596 页。

E. Mansour, D. Deng, A. Qahtan, R. C. Fernandez, Wenbo, Z. Abedjan, A. Elmagarmid, I. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. 2018. Building data civilizer pipelines with an advanced workflow engine. In Proc. 34th International Conference on Data Engineering, pp. 1593–1596.

T. 马特森、DA Bader、JW Berry、A. Buluç、J. Dongarra、C. Faloutsos、J. Feo、JR Gilbert、J. Gonzalez、B. Hendrickson、J. Kepner、CE Leiserson、A. Lumsdaine、DA 帕多瓦、S. Poole、SP Reinhardt、M. Stonebraker、S. Wallach 和 A. Yoo。2013。图算法原语标准。在过程中。2013 年 IEEE 高性能极限计算会议,第 1-2 页。DOI:10.1109/HPEC.2013.6670338。

T. Mattson, D. A. Bader, J. W. Berry, A. Buluç, J. Dongarra, C. Faloutsos, J. Feo, J. R. Gilbert, J. Gonzalez, B. Hendrickson, J. Kepner, C. E. Leiserson, A. Lumsdaine, D. A. Padua, S. Poole, S. P. Reinhardt, M. Stonebraker, S. Wallach, and A. Yoo. 2013. Standards for graph algorithm primitives. In Proc. 2013 IEEE High Performance Extreme Computing Conference, pp. 1–2. DOI: 10.1109/HPEC.2013.6670338.

T. Mattson、DA Bader、JW Berry、A. Buluç、JJ Dongarra、C. Faloutsos、J. Feo、JR Gilbert、J. Gonzalez、B. Hendrickson、J. Kepner、CE Leiserson、A. Lumsdaine、DA 帕多瓦、 SW Poole、SP Reinhardt、M. Stonebraker、S. Wallach 和 A. Yoo。2014。图算法原语标准。CoRR,abs/1408.0393。DOI:10.1109/HPEC.2013.6670338。

T. Mattson, D. A. Bader, J. W. Berry, A. Buluç, J. J. Dongarra, C. Faloutsos, J. Feo, J. R. Gilbert, J. Gonzalez, B. Hendrickson, J. Kepner, C. E. Leiserson, A. Lumsdaine, D. A. Padua, S. W. Poole, S. P. Reinhardt, M. Stonebraker, S. Wallach, and A. Yoo. 2014. Standards for graph algorithm primitives. CoRR, abs/1408.0393. DOI: 10.1109/HPEC.2013.6670338.

NH 麦克唐纳和 M.斯通布雷克。1975. CUPID - 友好的查询语言。在过程中。ACM Pacific 75—数据:其使用、组织和管理,第 127-131 页。

N. H. McDonald and M. Stonebraker. 1975. CUPID - the friendly query language. In Proc. ACM Pacific 75—Data: Its Use, Organization and Management, pp. 127–131.

J. Meehan、N. Tatbul、SB Zdonik、C. Aslantas、U. Çetintemel、J. Du、T. Kraska、S. Madden、D. Maier、A. Pavlo、M. Stonebraker、K. Tufte 和 H.王. 2015a. S-store:流媒体满足交易处理。CoRR,abs/1503.01143。DOI:10.14778/2831360.2831367。234

J. Meehan, N. Tatbul, S. B. Zdonik, C. Aslantas, U. Çetintemel, J. Du, T. Kraska, S. Madden, D. Maier, A. Pavlo, M. Stonebraker, K. Tufte, and H. Wang. 2015a. S-store: Streaming meets transaction processing. CoRR, abs/1503.01143. DOI: 10.14778/2831360.2831367. 234

J. Meehan、N. Tatbul、S. Zdonik、C. Aslantas、U. Çetintemel、J. Du、T. Kraska、S. Madden、D. Maier、A. Pavlo、M. Stonebraker、K. Tufte 和 H王。2015b. S-store:流媒体满足交易处理。过程。VLDB 捐赠,8(13):2134–2145。DOI:10.14778/2831360.2831367。234、288、331、374

J. Meehan, N. Tatbul, S. Zdonik, C. Aslantas, U. Çetintemel, J. Du, T. Kraska, S. Madden, D. Maier, A. Pavlo, M. Stonebraker, K. Tufte, and H. Wang. 2015b. S-store: Streaming meets transaction processing. Proc. VLDB Endowment, 8(13): 2134–2145. DOI: 10.14778/2831360.2831367. 234, 288, 331, 374

J. Morcos、Z. Abedjan、IF Ilyas、M. Ouzzani、P. Papotti 和 M. Stonebraker。2015. Dataxformer:交互式数据转换工具。在过程中。ACM SIGMOD 国际数据管理会议,第 883–888 页。DOI:10.1145/2723372.2735366。296

J. Morcos, Z. Abedjan, I. F. Ilyas, M. Ouzzani, P. Papotti, and M. Stonebraker. 2015. Dataxformer: An interactive data transformation tool. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 883–888. DOI: 10.1145/2723372.2735366. 296

B. Muthuswamy、L. Kerschberg、C. Zaniolo、M. Stonebraker、DSP Jr. 和 M. Jarke。1985。专家 DBMS 架构(小组)。在过程中。1985 年 ACM 计算范围年会:80 年代中期,第 424-426 页。DOI:10.1145/320435.320555。

B. Muthuswamy, L. Kerschberg, C. Zaniolo, M. Stonebraker, D. S. P. Jr., and M. Jarke. 1985. Architectures for expert-DBMS (panel). In Proc. 1985 ACM Annual Conference on the Range of Computing: Mid-80’s, pp. 424–426. DOI: 10.1145/320435.320555.

K. O'Brien、V. Gadepally、J. Duggan、A. Dziedzic、AJ Elmore、J. Kepner、S. Madden、T. Mattson、Z. She 和 M. Stonebraker。2017.BigDAWG Polystore 发布和演示。CoRR,abs/1701.05799。http://arxiv.org/abs/1701.05799

K. O’Brien, V. Gadepally, J. Duggan, A. Dziedzic, A. J. Elmore, J. Kepner, S. Madden, T. Mattson, Z. She, and M. Stonebraker. 2017. BigDAWG polystore release and demonstration. CoRR, abs/1701.05799. http://arxiv.org/abs/1701.05799.

VE Ogle 和 M. Stonebraker。1995. Chabot:从图像关系数据库中检索。IEEE 计算机,28(9):40–48。DOI:10.1109/2.410150。

V. E. Ogle and M. Stonebraker. 1995. Chabot: Retrieval from a relational database of images. IEEE Computer, 28(9): 40–48. DOI: 10.1109/2.410150.

MA Olson、W. Hong、M. Ubell 和 M. Stonebraker。1996。并行对象关系数据库系统中的查询处理。季刊 IEEE 数据工程技术委员会,19(4):3–10。http://sites.computer.org/debull/96DEC-CD.pdf

M. A. Olson, W. Hong, M. Ubell, and M. Stonebraker. 1996. Query processing in a parallel object-relational database system. Quarterly Bulletin IEEE Technical Committee on Data Engineering, 19(4): 3–10. http://sites.computer.org/debull/96DEC-CD.pdf.

C. Olston、M. Stonebraker、A. Aiken 和 JM Hellerstein。1998a. VIQING:可视化交互式查询。在过程中。1998 年 IEEE 视觉语言研讨会,第 162–169 页。DOI:10.1109/VL.1998.706159。

C. Olston, M. Stonebraker, A. Aiken, and J. M. Hellerstein. 1998a. VIQING: visual interactive querying. In Proc. 1998 IEEE Symposium on Visual Languages, pp. 162–169. DOI: 10.1109/VL.1998.706159.

C. Olston、A. Woodruff、A. Aiken、M. Chu、V. Ercegovac、M. Lin、M. Spalding 和 M. Stonebraker。1998b. 数据飞溅。在过程中。ACM SIGMOD 国际数据管理会议,第 550–552 页。DOI:10.1145/276304.276377。

C. Olston, A. Woodruff, A. Aiken, M. Chu, V. Ercegovac, M. Lin, M. Spalding, and M. Stonebraker. 1998b. Datasplash. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 550–552. DOI: 10.1145/276304.276377.

J. Ong、D. Fogg 和 M. Stonebraker。1984. 在关系数据库系统 ingres 中实现数据抽象。ACM SIGMOD 记录,14(1):1–14。DOI:10.1145/984540.984541。201、202、206

J. Ong, D. Fogg, and M. Stonebraker. 1984. Implementation of data abstraction in the relational database system ingres. ACM SIGMOD Record, 14(1): 1–14. DOI: 10.1145/984540.984541. 201, 202, 206

R. Overmyer 和 M. Stonebraker。1982. 在数据库系统中实现时间专家。ACM SIGMOD 记录,12(3):51–60。DOI:10.1145/984505.984509。

R. Overmyer and M. Stonebraker. 1982. Implementation of a time expert in a data base system. ACM SIGMOD Record, 12(3): 51–60. DOI: 10.1145/984505.984509.

A. Pavlo、E. Paulson、A. Rasin、DJ Abadi、DJ DeWitt、S. Madden 和 M. Stonebraker。2009。大规模数据分析方法的比较。在过程中。ACM SIGMOD 国际数据管理会议,第 165–178 页。DOI:10.1145/1559845.1559865。

A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. 2009. A comparison of approaches to large-scale data analysis. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 165–178. DOI: 10.1145/1559845.1559865.

H.皮克、S.马登和M.斯通布雷克。2015 年。通过他们的成果,你们应该了解他们:数据分析师对大规模并行系统设计的看法。在过程中。第 11 届新硬件数据管理研讨会,第 5:1–5:6。DOI:10.1145/2771937.2771944。

H. Pirk, S. Madden, and M. Stonebraker. 2015. By their fruits shall ye know them: A data analyst’s perspective on massively parallel system design. In Proc. 11th Workshop on Data Management on New Hardware, pp. 5:1–5:6. DOI: 10.1145/2771937.2771944.

G. Planthaber、M. Stonebraker 和 J. Frew。2012. Earthdb:使用 SciDB 对 MODIS 数据进行可扩展分析。在过程中。第一届 ACM SIGSPATIAL 国际大地理空间数据分析研讨会,第 11-19 页。DOI:10.1145/2447481.2447483。

G. Planthaber, M. Stonebraker, and J. Frew. 2012. Earthdb: scalable analysis of MODIS data using SciDB. In Proc. 1st ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data, pp. 11–19. DOI: 10.1145/2447481.2447483.

S. Potamianos 和 M. Stonebraker。1996. POSTGRES 规则系统。摘自《主动数据库系统:高级数据库处理的触发器和规则》,第 43-61 页。摩根·考夫曼。91, 168

S. Potamianos and M. Stonebraker. 1996. The POSTGRES rules system. In Active Database Systems: Triggers and Rules For Advanced Database Processing, pp. 43–61. Morgan Kaufmann. 91, 168

C. Ré、D. Agrawal、M. Balazinska、M. Cafarella、M. Jordan、T. Kraska 和 R. Ramakrishnan。2015. 机器学习和数据库:未来的声音还是炒作的刺耳声音?在过程中。ACM SIGMOD 国际数据管理会议,第 283–284 页。

C. Ré, D. Agrawal, M. Balazinska, M. Cafarella, M. Jordan, T. Kraska, and R. Ramakrishnan. 2015. Machine learning and databases: The sound of things to come or a cacophony of hype? In Proc. ACM SIGMOD International Conference on Management of Data, pp. 283–284.

里斯 (Ries) 博士和斯通布雷克 (M. Stonebraker)。1977a. 数据库管理系统中锁定粒度的影响。ACM 数据库系统交易,2(3):233–246。DOI:10.1145/320557.320566。91, 198

D. R. Ries and M. Stonebraker. 1977a. Effects of locking granularity in a database management system. ACM Transactions on Database Systems, 2(3): 233–246. DOI: 10.1145/320557.320566. 91, 198

里斯 (Ries) 博士和斯通布雷克 (M. Stonebraker)。1977b. 研究数据库管理系统中锁定粒度的影响(摘要)。在过程中。ACM SIGMOD 国际数据管理会议,p。121.DOI:10.1145/509404.509422。91

D. R. Ries and M. Stonebraker. 1977b. A study of the effects of locking granularity in a data base management system (abstract). In Proc. ACM SIGMOD International Conference on Management of Data, p. 121. DOI: 10.1145/509404.509422. 91

里斯 (Ries) 博士和斯通布雷克 (M. Stonebraker)。1979. 重新审视锁定粒度。ACM 数据库系统交易,4(2):210–227。http://doi.acm.org/10.1145/320071.320078。DOI:10.1145/320071.320078。91

D. R. Ries and M. Stonebraker. 1979. Locking granularity revisited. ACM Transactions on Database Systems, 4(2): 210–227. http://doi.acm.org/10.1145/320071.320078. DOI: 10.1145/320071.320078. 91

LA Rowe 和 M. Stonebraker。1981.未来数据库系统的体系结构。ACMSIGMOD 记录,11(1):30–44。DOI:10.1145/984471.984473。

L. A. Rowe and M. Stonebraker. 1981. Architecture of future data base systems. ACMSIGMOD Record, 11(1): 30–44. DOI: 10.1145/984471.984473.

LA Rowe 和 M. Stonebraker。1986 年。商业 INGRES 尾声。摘自 M. Stonebraker,编辑,《INGRES 论文:关系数据库系统剖析》,第 63-82 页。艾迪生-韦斯利。

L. A. Rowe and M. Stonebraker. 1986. The commercial INGRES epilogue. In M. Stonebraker, editor, The INGRES Papers: Anatomy of a Relational Database System, pp. 63–82. Addison-Wesley.

LA Rowe 和 M. Stonebraker。1987. POSTGRES 数据模型。在过程中。第 13 届超大型数据库国际会议,第 83-96 页。http://www.vldb.org/conf/1987/P083.pdf。258

L. A. Rowe and M. Stonebraker. 1987. The POSTGRES data model. In Proc. 13th International Conference on Very Large Data Bases, pp. 83–96. http://www.vldb.org/conf/1987/P083.pdf. 258

LA Rowe 和 M. Stonebraker。1990. POSTGRES 数据模型。AF Cardenas 和 D. McLeod 编辑,《面向对象和语义数据库系统研究基础》,第 91-110 页。普伦蒂斯·霍尔。

L. A. Rowe and M. Stonebraker. 1990. The POSTGRES data model. In A. F. Cardenas and D. McLeod, editors, Research Foundations in Object-Oriented and Semantic Database Systems, pp. 91–110. Prentice Hall.

S. Sarawagi 和 M. Stonebraker。1994.大型多维数组的有效组织。在过程中。第十届国际数据工程会议,第 328–336 页。DOI:10.1109/ICDE.1994.283048。

S. Sarawagi and M. Stonebraker. 1994. Efficient organization of large multidimensional arrays. In Proc. 10th International Conference on Data Engineering, pp. 328–336. DOI: 10.1109/ICDE.1994.283048.

S. Sarawagi 和 M. Stonebraker。1996。重新排序三级内存数据库中的查询执行。在过程中。第 22 届国际超大型数据库会议http://www.vldb.org/conf/1996/P156.pdf

S. Sarawagi and M. Stonebraker. 1996. Reordering query execution in tertiary memory databases. In Proc. 22th International Conference on Very Large Data Bases. http://www.vldb.org/conf/1996/P156.pdf.

GA Schloss 和 M. Stonebraker。1990.分布式数据的高度冗余管理。在过程中。复制数据管理研讨会,第 91-92 页。

G. A. Schloss and M. Stonebraker. 1990. Highly redundant management of distributed data. In Proc. Workshop on the Management of Replicated Data, pp. 91–92.

A. Seering、P. Cudré-Mauroux、S. Madden 和 M. Stonebraker。2012。科学阵列数据库的高效版本控制。在过程中。第 28 届国际数据工程会议,第 1013-1024 页。DOI:10.1109/ICDE.2012.102。

A. Seering, P. Cudré-Mauroux, S. Madden, and M. Stonebraker. 2012. Efficient versioning for scientific array databases. In Proc. 28th International Conference on Data Engineering, pp. 1013–1024. DOI: 10.1109/ICDE.2012.102.

LJ Seligman、NJ Belkin、EJ Neuhold、M. Stonebraker 和 G. Wiederhold。1995. 访问异构数据的指标:有希望吗?(控制板)。在过程中。第 21 届国际超大型数据库会议,第 19 页。633.http: //www.vldb.org/conf/1995/P633.pdf

L. J. Seligman, N. J. Belkin, E. J. Neuhold, M. Stonebraker, and G. Wiederhold. 1995. Metrics for accessing heterogeneous data: Is there any hope? (panel). In Proc. 21th International Conference on Very Large Data Bases, p. 633. http://www.vldb.org/conf/1995/P633.pdf.

MI Seltzer 和 M. Stonebraker。1990 年。读优化和写优化文件系统中的事务支持。在过程中。第 16 届超大型数据库国际会议,第 174-185 页。http://www.vldb.org/conf/1990/P174.pdf

M. I. Seltzer and M. Stonebraker. 1990. Transaction support in read optimizied and write optimized file systems. In Proc. 16th International Conference on Very Large Data Bases, pp. 174–185. http://www.vldb.org/conf/1990/P174.pdf.

MI Seltzer 和 M. Stonebraker。1991. 阅读优化的文件系统设计:性能评估。在过程中。第七届国际数据工程会议,第 602-611 页。DOI:10.1109/ICDE.1991.131509。

M. I. Seltzer and M. Stonebraker. 1991. Read optimized file system designs: A performance evaluation. In Proc. 7th International Conference on Data Engineering, pp. 602–611. DOI: 10.1109/ICDE.1991.131509.

M. Serafini、R. Taft、AJ Elmore、A. Pavlo、A. Aboulnaga 和 M. Stonebraker。2016. Clay:通用数据库模式的细粒度自适应分区。过程。VLDB 捐赠,10(4):445–456。DOI:10.14778/3025111.3025125。

M. Serafini, R. Taft, A. J. Elmore, A. Pavlo, A. Aboulnaga, and M. Stonebraker. 2016. Clay: Finegrained adaptive partitioning for general database schemas. Proc. VLDB Endowment, 10(4): 445–456. DOI: 10.14778/3025111.3025125.

J. Sidell、PM Aoki、A. Sah、C. Staelin、M. Stonebraker 和 A. Yu。1996. Mariposa 中的数据复制。在过程中。第 12 届国际数据工程会议,第 485-494 页。DOI:10.1109/ICDE.1996.492198。

J. Sidell, P. M. Aoki, A. Sah, C. Staelin, M. Stonebraker, and A. Yu. 1996. Data replication in mariposa. In Proc. 12th International Conference on Data Engineering, pp. 485–494. DOI: 10.1109/ICDE.1996.492198.

A. Silberschatz、M. Stonebraker 和 JD Ullman。1990. 数据库系统:成就和机遇——1990 年 2 月 22-23 日在加利福尼亚州帕洛阿尔托举行的 NSF 数据库系统研究未来邀请研讨会的“Lagunita”报告。ACM SIGMOD Record,19(4) : 6 –22。DOI:10.1145/122058.122059。92

A. Silberschatz, M. Stonebraker, and J. D. Ullman. 1990. Database systems: Achievements and opportunities—the “Lagunita” report of the NSF invitational workshop on the future of database system research held in Palo Alto, CA, February 22–23, 1990. ACM SIGMOD Record, 19(4): 6–22. DOI: 10.1145/122058.122059. 92

A. Silberschatz、M. Stonebraker 和 JD Ullman。1991。数据库系统:成就和机遇。ACM 通讯,34(10):110–120。DOI:10.1145/125223.125272。

A. Silberschatz, M. Stonebraker, and J. D. Ullman. 1991. Database systems: Achievements and opportunities. Communications of the ACM, 34(10): 110–120. DOI: 10.1145/125223.125272.

A. Silberschatz、M. Stonebraker 和 JD Ullman。1996.数据库研究:21世纪的成就和机遇。ACM SIGMOD 记录,25(1):52–63。DOI:10.1145/381854.381886。

A. Silberschatz, M. Stonebraker, and J. D. Ullman. 1996. Database research: Achievements and opportunities into the 21st century. ACM SIGMOD Record, 25(1): 52–63. DOI: 10.1145/381854.381886.

D.斯基恩和M.斯通布雷克。1981 年。分布式系统中崩溃恢复的正式模型。在过程中。第五届伯克利分布式数据管理和计算机网络研讨会,第 129–142 页。

D. Skeen and M. Stonebraker. 1981. A formal model of crash recovery in a distributed system. In Proc. 5th Berkeley Workshop on Distributed Data Management and Computer Networks, pp. 129–142.

D.斯基恩和M.斯通布雷克。1983 年。分布式系统中崩溃恢复的正式模型。IEEE 软件工程汇刊,9(3):219–228。DOI:10.1109/TSE.1983.236608。199

D. Skeen and M. Stonebraker. 1983. A formal model of crash recovery in a distributed system. IEEE Transactions on Software Engineering, 9(3): 219–228. DOI: 10.1109/TSE.1983.236608. 199

M.斯通布雷克和R.卡特尔。2011.10 “简单操作”数据存储中可扩展性能的规则。ACM 通讯,54(6):72-80。DOI:10.1145/1953122.1953144。

M. Stonebraker and R. Cattell. 2011.10 rules for scalable performance in “simple operation” datastores. Communications of the ACM, 54(6): 72–80. DOI: 10.1145/1953122.1953144.

M. Stonebraker 和 U. Çetintemel。2005年。“一刀切”:一个时代已经过去又过去的想法(摘要)。在过程中。第 21 届国际数据工程会议,第 2-11 页。DOI:10.1109/ICDE.2005.1。50、92、103、131、152、367、401

M. Stonebraker and U. Çetintemel. 2005. “One size fits all”: An idea whose time has come and gone (abstract). In Proc. 21st International Conference on Data Engineering, pp. 2–11. DOI: 10.1109/ICDE.2005.1. 50, 92, 103, 131, 152, 367, 401

M. Stonebraker 和 DJ DeWitt。2008 年。向吉姆·格雷致敬。ACM 通讯,51(11):54-57。DOI:10.1145/1400214.1400230。

M. Stonebraker and D. J. DeWitt. 2008. A tribute to Jim Gray. Communications of the ACM, 51(11): 54–57. DOI: 10.1145/1400214.1400230.

M.斯通布雷克和A.格特曼。1984 年。使用关系数据库管理系统来处理计算机辅助设计数据——更新。季刊 IEEE 数据工程技术委员会,7(2):56–60。http://sites.computer.org/debull/84JUN-CD.pdf

M. Stonebraker and A. Guttman. 1984. Using a relational database management system for computer aided design data—an update. Quarterly Bulletin IEEE Technical Committee on Data Engineering, 7(2): 56–60. http://sites.computer.org/debull/84JUN-CD.pdf.

M.斯通布雷克和A.格特曼。1984. R-trees:用于空间搜索的动态索引结构。在过程中。1984 年 ACM SIGMOD 国际数据管理会议 (SIGMOD '84),第 47-57 页。ACM,纽约。DOI:10.1145/602259.602266。201

M. Stonebraker and A. Guttman. 1984. R-trees: a dynamic index structure for spatial searching. In Proc. of the 1984 ACM SIGMOD International Conference on Management of Data (SIGMOD ’84), pp. 47–57. ACM, New York. DOI:10.1145/602259.602266. 201

M.斯通布雷克和MA赫斯特。1988。专家数据库系统的未来趋势。在过程中。第二届国际专家数据库系统会议,第 3-20 页。第395章

M. Stonebraker and M. A. Hearst. 1988. Future trends in expert data base systems. In Proc. 2nd International Conference on Expert Database Systems, pp. 3–20. 395

M.斯通布雷克和G.赫尔德。1975.数据库管理系统中的网络、层次结构和关系。在过程中。ACM Pacific 75—数据:其使用、组织和管理,第 1-9 页。

M. Stonebraker and G. Held. 1975. Networks, hierarchies and relations in data base management systems. In Proc. ACM Pacific 75—Data: Its Use, Organization and Management, pp. 1–9.

M. Stonebraker 和 JM Hellerstein,编辑。1998 年。《数据库系统读物》,3。摩根·考夫曼。

M. Stonebraker and J. M. Hellerstein, editors. 1998. Readings in Database Systems, 3. Morgan Kaufmann.

M.斯通布雷克和JM海勒斯坦。2001 年。电子商务内容集成。在过程中。ACM SIGMOD 国际数据管理会议,第 552–560 页。DOI:10.1145/375663.375739。

M. Stonebraker and J. M. Hellerstein. 2001. Content integration for e-business. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 552–560. DOI: 10.1145/375663.375739.

M.斯通布雷克和J.洪。2009 年。告别 DBMSS,设计有效的接口。ACM 通讯,52(9):12-13。DOI:10.1145/1562164.1562169。

M. Stonebraker and J. Hong. 2009. Saying good-bye to DBMSS, designing effective interfaces. Communications of the ACM, 52(9): 12–13. DOI: 10.1145/1562164.1562169.

M.斯通布雷克和J.洪。2012. 研究人员的大数据危机;了解设计和功能。ACM 通讯,55(2):10-11。DOI:10.1145/2076450.2076453。

M. Stonebraker and J. Hong. 2012. Researchers’ big data crisis; understanding design and functionality. Communications of the ACM, 55(2): 10–11. DOI: 10.1145/2076450.2076453.

M.斯通布雷克和J.卡拉什。1982. TIMBER:一种复杂的关系浏览器(特邀论文)。在过程中。第八届国际数据库会议,第 1-10 页。http://www.vldb.org/conf/1982/P001.pdf

M. Stonebraker and J. Kalash. 1982. TIMBER: A sophisticated relation browser (invited paper). In Proc. 8th International Conference on Very Data Bases, pp. 1–10. http://www.vldb.org/conf/1982/P001.pdf.

M.斯通布雷克和K.凯勒。1980.将专家知识和假设数据库嵌入到数据库系统中。在过程中。ACM SIGMOD 国际数据管理会议,第 58-66 页。DOI:10.1145/582250.582261。200

M. Stonebraker and K. Keller. 1980. Embedding expert knowledge and hypothetical data bases into a data base system. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 58–66. DOI: 10.1145/582250.582261. 200

M.斯通布雷克和G.凯姆尼茨。1991.Postgres 下一代数据库管理系统。ACM 通讯,34(10):78-92。DOI:10.1145/125223.125262。168、206、213

M. Stonebraker and G. Kemnitz. 1991. The Postgres next generation database management system. Communications of the ACM, 34(10): 78–92. DOI: 10.1145/125223.125262. 168, 206, 213

M.斯通布雷克和A.库马尔。1986 年。操作系统支持数据管理。季刊 IEEE 数据工程技术委员会,9(3):43–50。http://sites.computer.org/debull/86SEP-CD.pdf.47

M. Stonebraker and A. Kumar. 1986. Operating system support for data management. Quarterly Bulletin IEEE Technical Committee on Data Engineering, 9(3): 43–50. http://sites.computer.org/debull/86SEP-CD.pdf.47

M.斯通布雷克和D.摩尔。1996.对象关系 DBMS:下一波浪潮。摩根·考夫曼。111

M. Stonebraker and D. Moore. 1996. Object-Relational DBMSs: The Next Great Wave. Morgan Kaufmann. 111

M.斯通布雷克和EJ纽霍尔德。1977 年。INGRES 的分布式数据库版本。在过程中。第二届伯克利分布式数据管理和计算机网络研讨会,第 19-36 页。109、198、199

M. Stonebraker and E. J. Neuhold. 1977. A distributed database version of INGRES. In Proc. 2nd Berkeley Workshop on Distributed Data Management and Computer Networks, pp. 19–36. 109, 198, 199

M.斯通布雷克和MA奥尔森。1993. POSTGRES 中的大对象支持。在过程中。第九届国际数据工程会议,第 355-362 页。DOI:10.1109/ICDE.1993.344046。

M. Stonebraker and M. A. Olson. 1993. Large object support in POSTGRES. In Proc. 9th International Conference on Data Engineering, pp. 355–362. DOI: 10.1109/ICDE.1993.344046.

M.斯通布雷克和J.罗伯逊。2013 年。大数据是“时下流行语”;CS 学者“拥有最好的工作”。ACM 通讯,56(9):10-11。DOI:10.1145/2500468.2500471。

M. Stonebraker and J. Robertson. 2013. Big data is “buzzword du jour;” CS academics “have the best job”. Communications of the ACM, 56(9): 10–11. DOI: 10.1145/2500468.2500471.

M.斯通布雷克和LA罗。1977 年。对数据操作语言及其在通用编程语言中的嵌入的观察。在过程中。第三届国际数据库会议,第 128-143 页。

M. Stonebraker and L. A. Rowe. 1977. Observations on data manipulation languages and their embedding in general purpose programming languages. In Proc. 3rd International Conference on Very Data Bases, pp. 128–143.

M.斯通布雷克和LA罗。1984 年。数据库门户:新的应用程序界面。在过程中。第 10 届超大型数据库国际会议,第 3-13 页。http://www.vldb.org/conf/1984/P003.pdf

M. Stonebraker and L. A. Rowe. 1984. Database portals: A new application program interface. In Proc. 10th International Conference on Very Large Data Bases, pp. 3–13. http://www.vldb.org/conf/1984/P003.pdf.

M.斯通布雷克和LA罗。1986.Postgres 的设计。在过程中。ACM SIGMOD 国际数据管理会议,第 340–355 页。DOI:10.1145/16894.16888。149、203、206

M. Stonebraker and L. A. Rowe. 1986. The design of Postgres. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 340–355. DOI: 10.1145/16894.16888. 149, 203, 206

M.斯通布雷克和P.鲁宾斯坦。1976. INGRES 保护系统。在过程中。1976 年 ACM 年会,第 80-84 页。DOI:10.1145/800191.805536。第398章

M. Stonebraker and P. Rubinstein. 1976. The INGRES protection system. In Proc. 1976 ACM Annual Conference, pp. 80–84. DOI: 10.1145/800191.805536. 398

M. Stonebraker 和 GA Schloss。1990. 分布式 RAID——一种新的多重复制算法。在过程中。第六届国际数据工程会议,第 430–437 页。DOI:10.1109/ICDE.1990.113496。

M. Stonebraker and G. A. Schloss. 1990. Distributed RAID—A new multiple copy algorithm. In Proc. 6th International Conference on Data Engineering, pp. 430–437. DOI: 10.1109/ICDE.1990.113496.

M.斯通布雷克和A.韦斯伯格。2013. VoltDB 主内存 DBMS。季刊 IEEE 数据工程技术委员会,36(2):21–27。http://sites.computer.org/debull/A13june/VoltDBl.pdf

M. Stonebraker and A. Weisberg. 2013. The VoltDB main memory DBMS. Quarterly Bulletin IEEE Technical Committee on Data Engineering, 36(2): 21–27. http://sites.computer.org/debull/A13june/VoltDBl.pdf.

M. Stonebraker 和 E. Wong。1974b. 通过查询修改来控制关系数据库管理系统中的访问。在过程中。1974 年 ACM 年会,第 1 卷,第 180–186 页。DOI:10.1145/800182.810400。45

M. Stonebraker and E. Wong. 1974b. Access control in a relational data base management system by query modification. In Proc. 1974 ACM Annual Conference, Volume 1, pp. 180–186. DOI: 10.1145/800182.810400. 45

M. Stonebraker、P. Rubinstein、R. Conway、D. Strip、HR Hartson、DK Hsiao 和 EB Fernandez。1976a。SIGBDP(论文会议)。在过程中。1976 年 ACM 年会,第 14 页。79. DOI:10.1145/800191.805535。

M. Stonebraker, P. Rubinstein, R. Conway, D. Strip, H. R. Hartson, D. K. Hsiao, and E. B. Fernandez. 1976a. SIGBDP (paper session). In Proc. 1976 ACM Annual Conference, p. 79. DOI: 10.1145/800191.805535.

M. Stonebraker、E. Wong、P. Kreps 和 G. Held。1976b. INGRES的设计和实现。ACM 数据库系统交易,1(3):189–222。DOI:10.1145/320473.320476。47、148、398

M. Stonebraker, E. Wong, P. Kreps, and G. Held. 1976b. The design and implementation of INGRES. ACM Transactions on Database Systems, 1(3): 189–222. DOI: 10.1145/320473.320476. 47, 148, 398

M. Stonebraker、RR Johnson 和 S. Rosenberg。1982a。关系数据库管理系统的规则系统。在过程中。第二届国际数据库会议:提高数据库可用性和响应能力,第 323-335 页。91、202

M. Stonebraker, R. R. Johnson, and S. Rosenberg. 1982a. A rules system for a relational data base management system. In Proc. 2nd International Conference on Databases: Improving Database Usability and Responsiveness, pp. 323–335. 91, 202

M. Stonebraker、J. Woodfill、J. Ranstrom、MC Murphy、J. Kalash、MJ Carey 和 K. Arnold。1982b. 分布式数据库系统的性能分析。季刊 IEEE 数据工程技术委员会,5(4):58–65。http://sites.computer.org/debull/82DEC-CD.pdf

M. Stonebraker, J. Woodfill, J. Ranstrom, M. C. Murphy, J. Kalash, M. J. Carey, and K. Arnold. 1982b. Performance analysis of distributed data base systems. Quarterly Bulletin IEEE Technical Committee on Data Engineering, 5(4): 58–65. http://sites.computer.org/debull/82DEC-CD.pdf.

M.斯通布雷克、WB鲁宾斯坦和A.格特曼。1983a。抽象数据类型和抽象索引在 CAD 数据库中的应用。《工程设计应用》,第 107-113 页。

M. Stonebraker, W. B. Rubenstein, and A. Guttman. 1983a. Application of abstract data types and abstract indices to CAD data bases. In Engineering Design Applications, pp. 107–113.

M. Stonebraker、H. Stettner、N. Lynn、J. Kalash 和 A. Guttman。1983b. 关系数据库系统中的文档处理。ACM 信息系统汇刊,1(2):143–158。DOI:10.1145/357431.357433。

M. Stonebraker, H. Stettner, N. Lynn, J. Kalash, and A. Guttman. 1983b. Document processing in a relational database system. ACM Transactions on Information Systems, 1(2): 143–158. DOI: 10.1145/357431.357433.

M.斯通布雷克、J.伍德菲尔和E.安德森。1983c。关系数据库系统中规则的实现。季刊 IEEE 数据工程技术委员会,6(4):65–74。http://sites.computer.org/debull/83DEC-CD.pdf。91、202

M. Stonebraker, J. Woodfill, and E. Andersen. 1983c. Implementation of rules in relational data base systems. Quarterly Bulletin IEEE Technical Committee on Data Engineering, 6(4): 65–74. http://sites.computer.org/debull/83DEC-CD.pdf. 91, 202

M. Stonebraker、J. Woodfill、J. Ranstrom、J. Kalash、K. Arnold 和 E. Andersen。1983d. 分布式数据库系统的性能分析。在过程中。第二届可靠分布式系统研讨会,第 135–138 页。

M. Stonebraker, J. Woodfill, J. Ranstrom, J. Kalash, K. Arnold, and E. Andersen. 1983d. Performance analysis of distributed data base systems. In Proc. 2nd Symposium on Reliable Distributed Systems, pp. 135–138.

M. Stonebraker、J. Woodfill、J. Ranstrom、MC Murphy、M. Meyer 和 E. Allman。1983e。关系数据库系统的性能增强。ACM 数据库系统交易,8(2):167–185。DOI:10.1145/319983.319984。

M. Stonebraker, J. Woodfill, J. Ranstrom, M. C. Murphy, M. Meyer, and E. Allman. 1983e. Performance enhancements to a relational database system. ACM Transactions on Database Systems, 8(2): 167–185. DOI: 10.1145/319983.319984.

M. Stonebraker、E. Anderson、EN Hanson 和 WB Rubenstein。1984a。Quel 作为一种数据类型。在过程中。ACM SIGMOD 国际数据管理会议,第 208-214 页。DOI:10.1145/602259.602287。208

M. Stonebraker, E. Anderson, E. N. Hanson, and W. B. Rubenstein. 1984a. Quel as a data type. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 208–214. DOI: 10.1145/602259.602287. 208

M. Stonebraker、J. Woodfill、J. Ranstrom、J. Kalash、K. Arnold 和 E. Andersen。1984b. 分布式数据库系统的性能分析。绩效评估,4(3):220。DOI:10.1016/0166-5316(84)90036-1。

M. Stonebraker, J. Woodfill, J. Ranstrom, J. Kalash, K. Arnold, and E. Andersen. 1984b. Performance analysis of distributed data base systems. Performance Evaluation, 4(3): 220. DOI: 10.1016/0166-5316(84)90036-1.

M.斯通布雷克、D.杜布迪厄和W.爱德华兹。1985 年。在操作系统事务管理器中支持数据库事务的问题。操作系统评论,19(1):6-14。DOI:10.1145/1041490.1041491。

M. Stonebraker, D. DuBourdieux, and W. Edwards. 1985. Problems in supporting data base transactions in an operating system transaction manager. Operating Systems Review, 19(1): 6–14. DOI: 10.1145/1041490.1041491.

M.斯通布雷克、TK 塞利斯和 EN 汉森。1986.数据库系统中规则索引实现的分析。在过程中。第一届国际专家数据库系统会议,第 465–476 页。91

M. Stonebraker, T. K. Sellis, and E. N. Hanson. 1986. An analysis of rule indexing implementations in data base systems. In Proc. 1st International Conference on Expert Database Systems, pp. 465–476. 91

M.斯通布雷克、J.安东和 EN 汉森。1987a. 使用过程扩展数据库系统。ACM 数据库系统交易,12(3):350–376。DOI:10.1145/27629.27631。

M. Stonebraker, J. Anton, and E. N. Hanson. 1987a. Extending a database system with procedures. ACM Transactions on Database Systems, 12(3): 350–376. DOI: 10.1145/27629.27631.

M.斯通布雷克、J.安东和M.广滨。1987b. POSTGRES 中的可扩展性。季刊 IEEE 数据工程技术委员会,10(2):16–23。http://sites.computer.org/debull/87JUN-CD.pdf

M. Stonebraker, J. Anton, and M. Hirohama. 1987b. Extendability in POSTGRES. Quarterly Bulletin IEEE Technical Committee on Data Engineering, 10(2): 16–23. http://sites.computer.org/debull/87JUN-CD.pdf.

M. Stonebraker、EN Hanson 和 C. Hong。1987c. postgres规则系统的设计。在过程中。第三届国际数据工程会议,第 365–374 页。DOI:10.1109/ICDE.1987.7272402。91

M. Stonebraker, E. N. Hanson, and C. Hong. 1987c. The design of the postgres rules system. In Proc. 3th International Conference on Data Engineering, pp. 365–374. DOI: 10.1109/ICDE.1987.7272402. 91

M. Stonebraker、EN Hanson 和 S. Potamianos。1988a. POSTGRES 规则管理器。IEEE 软件工程汇刊,14(7):897–907。DOI:10.1109/32.42733。91, 168

M. Stonebraker, E. N. Hanson, and S. Potamianos. 1988a. The POSTGRES rule manager. IEEE Transactions on Software Engineering, 14(7): 897–907. DOI: 10.1109/32.42733. 91, 168

M. Stonebraker、RH Katz、DA Patterson 和 JK Ousterhout。1988b. XPRS的设计。在过程中。第 14 届超大型数据库国际会议,第 318-330 页。http://www.vldb.org/conf/1988/P318.pdf

M. Stonebraker, R. H. Katz, D. A. Patterson, and J. K. Ousterhout. 1988b. The design of XPRS. In Proc. 14th International Conference on Very Large Data Bases, pp. 318–330. http://www.vldb.org/conf/1988/P318.pdf.

M.斯通布雷克、MA赫斯特和S.波塔米亚诺斯。1989. 对 POSTGRES 规则系统的评论。ACM SIGMOD 记录,18(3):5–11。DOI:10.1145/71031.71032。91、168、395

M. Stonebraker, M. A. Hearst, and S. Potamianos. 1989. A commentary on the POSTGRES rule system. ACM SIGMOD Record, 18(3): 5–11. DOI: 10.1145/71031.71032. 91, 168, 395

M. Stonebraker、A. Jhinran、J. Goh 和 S. Potamianos。1990a. 关于数据库系统中的规则、过程、缓存和视图。在过程中。ACM SIGMOD 国际数据管理会议,第 281-290 页。DOI:10.1145/93597.98737。

M. Stonebraker, A. Jhingran, J. Goh, and S. Potamianos. 1990a. On rules, procedures, caching and views in data base systems. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 281–290. DOI: 10.1145/93597.98737.

M. Stonebraker、LA Rowe 和 M. Hirohama。1990b. postgres的实现。IEEE 知识与数据工程汇刊,2(1):125–142。DOI:10.1109/69.50912。47, 168

M. Stonebraker, L. A. Rowe, and M. Hirohama. 1990b. The implementation of postgres. IEEE Transactions on Knowledge and Data Engineering, 2(1): 125–142. DOI: 10.1109/69.50912. 47, 168

M. Stonebraker、LA Rowe、BG Lindsay、J. Gray、MJ Carey 和 D. Beech。1990e。第三代数据库系统宣言——先进DBMS功能委员会。在过程中。ACM SIGMOD 国际数据管理会议,p。396.

M. Stonebraker, L. A. Rowe, B. G. Lindsay, J. Gray, M. J. Carey, and D. Beech. 1990e. Third generation data base system manifesto—The committee for advanced DBMS function. In Proc. ACM SIGMOD International Conference on Management of Data, p. 396.

M. Stonebraker、LA Rowe、BG Lindsay、J. Gray、MJ Carey、ML Brodie、PA Bernstein 和 D. Beech。1990c。第三代数据库系统宣言——高级DBMS功能委员会。ACM SIGMOD 记录,19(3):31–44。DOI:10.1145/101077.390001。91

M. Stonebraker, L. A. Rowe, B. G. Lindsay, J. Gray, M. J. Carey, M. L. Brodie, P. A. Bernstein, and D. Beech. 1990c. Third-generation database system manifesto—The committee for advanced DBMS function. ACM SIGMOD Record, 19(3): 31–44. DOI: 10.1145/101077.390001. 91

M. Stonebraker、LA Rowe、BG Lindsay、J. Gray、MJ Carey、ML Brodie、PA Bernstein 和 D. Beech。1990d. 第三代数据库系统宣言——高级DBMS功能委员会。在过程中。IFIP TC2/WG 2.6 面向对象数据库工作会议:分析、设计和构建,第 495-511 页。91

M. Stonebraker, L. A. Rowe, B. G. Lindsay, J. Gray, M. J. Carey, M. L. Brodie, P. A. Bernstein, and D. Beech. 1990d. Third-generation database system manifesto—The committee for advanced DBMS function. In Proc. IFIP TC2/WG 2.6 Working Conference on Object-Oriented Databases: Analysis, Design & Construction, pp. 495–511. 91

M. Stonebraker、R. Agrawal、U. Dayal、EJ Neuhold 和 A. Reuter。1993a. DBMS 研究处于十字路口:维也纳更新。在过程中。第 19 届超大型数据库国际会议,第 688-692 页。http://www.vldb.org/conf/1993/P688.pdf

M. Stonebraker, R. Agrawal, U. Dayal, E. J. Neuhold, and A. Reuter. 1993a. DBMS research at a crossroads: The Vienna update. In Proc. 19th International Conference on Very Large Data Bases, pp. 688–692. http://www.vldb.org/conf/1993/P688.pdf.

M. Stonebraker、J. Chen、N. Nathan、C. Paxson、A. Su 和 J. Wu。1993b. Tioga:面向数据库的可视化工具。IEEE 会议可视化会议论文集,第 86-93 页。DOI:10.1109/VISUAL.1993.398855。第393章

M. Stonebraker, J. Chen, N. Nathan, C. Paxson, A. Su, and J. Wu. 1993b. Tioga: A database-oriented visualization tool. In Proceedings IEEE Conference Visualization, pp. 86–93. DOI: 10.1109/VISUAL.1993.398855. 393

M. Stonebraker、J. Chen、N. Nathan、C. Paxson 和 J. Wu。1993c. Tioga:为科学可视化应用程序提供数据管理支持。在过程中。第 19 届超大型数据库国际会议,第 25-38 页。http://www.vldb.org/conf/1993/P025.pdf。第393章

M. Stonebraker, J. Chen, N. Nathan, C. Paxson, and J. Wu. 1993c. Tioga: Providing data management support for scientific visualization applications. In Proc. 19th International Conference on Very Large Data Bases, pp. 25–38. http://www.vldb.org/conf/1993/P025.pdf. 393

M.斯通布雷克、J.弗鲁和J.多齐尔。1993d. SEQUOIA 2000 项目。在过程中。第三届空间数据库国际研讨会进展,第 397-412 页。DOI:10.1007/3-540-56869-7_22。

M. Stonebraker, J. Frew, and J. Dozier. 1993d. The SEQUOIA 2000 project. In Proc. 3rd International Symposium Advances in Spatial Databases, pp. 397–412. DOI: 10.1007/3-540-56869-7_22.

M. Stonebraker、J. Frew、K. Gardels 和 J. Meredith。1993e. 红杉 2000 基准。在过程中。ACM SIGMOD 国际数据管理会议,第 2-11 页。DOI:10.1145/170035.170038。

M. Stonebraker, J. Frew, K. Gardels, and J. Meredith. 1993e. The Sequoia 2000 benchmark. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 2–11. DOI: 10.1145/170035.170038.

M. Stonebraker、PM Aoki、R. Devine、W. Litwin 和 MA Olson。1994a. Mariposa:分布式数据的新架构。在过程中。第十届国际数据工程会议,第 54-65 页。DOI:10.1109/ICDE.1994.283004。401

M. Stonebraker, P. M. Aoki, R. Devine, W. Litwin, and M. A. Olson. 1994a. Mariposa: A new architecture for distributed data. In Proc. 10th International Conference on Data Engineering, pp. 54–65. DOI: 10.1109/ICDE.1994.283004. 401

M. Stonebraker、R. Devine、M. Kornacker、W. Litwin、A. Pfeffer、A. Sah 和 C. Staelin。1994b. Mariposa 中查询处理和数据迁移的经济范例。在过程中。第三届并行和分布式信息系统国际会议,第 58-67 页。DOI:10.1109/PDIS.1994.331732。

M. Stonebraker, R. Devine, M. Kornacker, W. Litwin, A. Pfeffer, A. Sah, and C. Staelin. 1994b. An economic paradigm for query processing and data migration in Mariposa. In Proc. 3rd International Conference on Parallel and Distributed Information Systems, pp. 58–67. DOI: 10.1109/PDIS.1994.331732.

M. Stonebraker、PM Aoki、W. Litwin、A. Pfeffer、A. Sah、J. Sidell、C. Staelin 和 A. Yu。1996. Mariposa:广域分布式数据库系统。VLDB 杂志,5(1):48–63。DOI:10.1007/s007780050015。

M. Stonebraker, P. M. Aoki, W. Litwin, A. Pfeffer, A. Sah, J. Sidell, C. Staelin, and A. Yu. 1996. Mariposa: A wide-area distributed database system. VLDB Journal, 5(1): 48–63. DOI: 10.1007/s007780050015.

M.斯通布雷克、P.布朗和M.赫尔巴赫。1998a. 互操作性、分布式应用程序和分布式数据库:虚拟表接口。季刊 IEEE 数据工程技术委员会,21(3):25–33。http://sites.computer.org/debull/98sept/informix.ps

M. Stonebraker, P. Brown, and M. Herbach. 1998a. Interoperability, distributed applications and distributed databases: The virtual table interface. Quarterly Bulletin IEEE Technical Committee on Data Engineering, 21(3): 25–33. http://sites.computer.org/debull/98sept/informix.ps.

M.斯通布雷克、P.布朗和D.摩尔。1998b. 对象关系 DBMS,2. Morgan Kaufmann。

M. Stonebraker, P. Brown, and D. Moore. 1998b. Object-Relational DBMSs, 2. Morgan Kaufmann.

M. Stonebraker、DJ Abadi、A. Batkin、X. Chen、M. Cherniack、M. Ferreira、E. Lau、A. Lin、S. Madden、EJ O'Neil、PE O'Neil、A. Rasin、N .Tran 和 SB Zdonik。2005a. C-store:面向列的 DBMS。在过程中。第 31 届超大型数据库国际会议,第 553-564 页。http://www.vldb2005.org/program/paper/thu/p553-stonebraker.pdf。104、132、151、238、242、258、333、335、402

M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O’Neil, P. E. O’Neil, A. Rasin, N. Tran, and S. B. Zdonik. 2005a. C-store: A column-oriented DBMS. In Proc. 31st International Conference on Very Large Data Bases, pp. 553–564. http://www.vldb2005.org/program/paper/thu/p553-stonebraker.pdf. 104, 132, 151, 238, 242, 258, 333, 335, 402

M. Stonebraker、U. Çetintemel 和 SB Zdonik。2005b. 实时流处理的8个要求。ACM SIGMOD 记录,34(4):42–47。DOI:10.1145/1107499.1107504。第282章

M. Stonebraker, U. Çetintemel, and S. B. Zdonik. 2005b. The 8 requirements of realtime stream processing. ACM SIGMOD Record, 34(4): 42–47. DOI: 10.1145/1107499.1107504. 282

M. Stonebraker、C. Bear、U. Çetintemel、M. Cherniack、T. Ge、N. Hachem、S. Harizopoulos、J. Lifter、J. Rogers 和 SB Zdonik。2007a. 一种尺寸适合所有人?第 2 部分:基准研究。在过程中。第三届创新数据系统研究双年度会议,第 173-184 页。http://www.cidrdb.org/cidr2007/papers/cidr07p20.pdf。103, 282

M. Stonebraker, C. Bear, U. Çetintemel, M. Cherniack, T. Ge, N. Hachem, S. Harizopoulos, J. Lifter, J. Rogers, and S. B. Zdonik. 2007a. One size fits all? Part 2: Benchmarking studies. In Proc. 3rd Biennial Conference on Innovative Data Systems Research, pp. 173–184. http://www.cidrdb.org/cidr2007/papers/cidr07p20.pdf. 103, 282

M. Stonebraker、S. Madden、DJ Abadi、S. Harizopoulos、N. Hachem 和 P. Helland。2007b. 建筑时代的结束(是时候彻底重写了)。在过程中。第 33 届超大型数据库国际会议,第 1150–1160 页。http://www.vldb.org/conf/2007/papers/induscial/p1150-stonebraker.pdf。247、341、344

M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, and P. Helland. 2007b. The end of an architectural era (it’s time for a complete rewrite). In Proc. 33rd International Conference on Very Large Data Bases, pp. 1150–1160. http://www.vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf. 247, 341, 344

M. Stonebraker、J. Becla、DJ DeWitt、K. Lim、D. Maier、O. Ratzesberger 和 SB Zdonik。2009 年。科学数据库和 SciDB 的要求。在过程中。第四届创新数据系统研究双年会http://www-db.cs.wisc.edu/cidr/cidr2009/Paper_26.pdf。第257章

M. Stonebraker, J. Becla, D. J. DeWitt, K. Lim, D. Maier, O. Ratzesberger, and S. B. Zdonik. 2009. Requirements for science data bases and SciDB. In Proc. 4th Biennial Conference on Innovative Data Systems Research. http://www-db.cs.wisc.edu/cidr/cidr2009/Paper_26.pdf. 257

M. Stonebraker、DJ Abadi、DJ DeWitt、S. Madden、E. Paulson、A. Pavlo 和 A. Rasin。2010. Mapreduce 和并行 DBMSS:是友还是敌?ACM 通讯,53(1):64-71。DOI:10.1145/1629175.1629197。50、136、251

M. Stonebraker, D. J. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. 2010. Mapreduce and parallel DBMSS: friends or foes? Communications of the ACM, 53(1): 64–71. DOI: 10.1145/1629175.1629197. 50, 136, 251

M.斯通布雷克、P.布朗、A.波利亚科夫和S.拉曼。2011.SciDB的架构。在过程中。第 23 届国际科学和统计数据库管理会议,第 1-16 页。DOI:10.1007/978-3-642-22351-8_1。

M. Stonebraker, P. Brown, A. Poliakov, and S. Raman. 2011. The architecture of SciDB. In Proc. 23rd International Conference on Scientific and Statistical Database Management, pp. 1–16. DOI: 10.1007/978-3-642-22351-8_1.

M. Stonebraker、A. Ailamaki、J. Kepner 和 AS Szalay。2012。科学数据库的未来。在过程中。第 28 届国际数据工程会议,第 7-8 页。DOI:10.1109/ICDE.2012.151。

M. Stonebraker, A. Ailamaki, J. Kepner, and A. S. Szalay. 2012. The future of scientific data bases. In Proc. 28th International Conference on Data Engineering, pp. 7–8. DOI: 10.1109/ICDE.2012.151.

M. Stonebraker、P. Brown、D. 张和 J. Becla。2013a. SciDB:用于具有复杂分析的应用程序的数据库管理系统。科学与工程计算,15(3):54–62。DOI:10.1109/MCSE.2013.19。

M. Stonebraker, P. Brown, D. Zhang, and J. Becla. 2013a. SciDB: A database management system for applications with complex analytics. Computing in Science and Engineering, 15(3): 54–62. DOI: 10.1109/MCSE.2013.19.

M. Stonebraker、D. Bruckner、IF Ilyas、G. Beskales、M. Cherniack、SB Zdonik、A. Pagan 和 S. Xu。2013b. 大规模数据管理:数据驯服器系统。在过程中。第六届创新数据系统研究双年会http://www.cidrdb.org/cidr2013/Papers/CIDR13_Paper28.pdf。105、150、269、297、357、358

M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. 2013b. Data curation at scale: The data tamer system. In Proc. 6th Biennial Conference on Innovative Data Systems Research. http://www.cidrdb.org/cidr2013/Papers/CIDR13_Paper28.pdf. 105, 150, 269, 297, 357, 358

M. Stonebraker、J. Duggan、L. Battle 和 O. Papaemmanouil。2013c. 麻省理工学院季刊 IEEE 数据工程技术委员会的 SciDB DBMS 研究,36(4):21–30。http://sites.computer.org/debull/A13dec/p21.pdf

M. Stonebraker, J. Duggan, L. Battle, and O. Papaemmanouil. 2013c. SciDB DBMS research at M.I.T. Quarterly Bulletin IEEE Technical Committee on Data Engineering, 36(4): 21–30. http://sites.computer.org/debull/A13dec/p21.pdf.

M.斯通布雷克、S.马登和P.杜贝。2013d. 英特尔“大数据”科技中心愿景与执行计划。ACM SIGMOD 记录,42(1):44–49。DOI:10.1145/2481528.2481537。

M. Stonebraker, S. Madden, and P. Dubey. 2013d. Intel “big data” science and technology center vision and execution plan. ACM SIGMOD Record, 42(1): 44–49. DOI: 10.1145/2481528.2481537.

M.斯通布雷克、A.帕夫洛、R.塔夫脱和ML布罗迪。2014 年。企业数据库应用程序和云:前进的艰难道路。在过程中。第七届 IEEE 国际云计算会议,第 1-6 页。DOI:10.1109/IC2E.2014.97。

M. Stonebraker, A. Pavlo, R. Taft, and M. L. Brodie. 2014. Enterprise database applications and the cloud: A difficult road ahead. In Proc. 7th IEEE International Conference on Cloud Computing, pp. 1–6. DOI: 10.1109/IC2E.2014.97.

M. Stonebraker、D. Deng 和 ML Brodie。2016。数据库衰退以及如何避免它。在过程中。2016 年 IEEE 大数据国际会议,第 7-16 页。DOI:10.1109/BigData.2016.7840584。

M. Stonebraker, D. Deng, and M. L. Brodie. 2016. Database decay and how to avoid it. In Proc. 2016 IEEE International Conference on Big Data, pp. 7–16. DOI: 10.1109/BigData.2016.7840584.

M. Stonebraker、D. Deng 和 ML Brodie。2017.应用程序-数据库共同进化:一种新的设计和开发范例。东北数据库日,第 1-3 页。

M. Stonebraker, D. Deng, and M. L. Brodie. 2017. Application-database co-evolution: A new design and development paradigm. In North East Database Day, pp. 1–3.

M.斯通布雷克。1972a。使用组合索引的检索效率。在过程中。1972 年 ACM-SIGFIDET 数据描述、访问和控制研讨会,第 243–256 页。

M. Stonebraker. 1972a. Retrieval efficiency using combined indexes. In Proc. 1972 ACM-SIGFIDET Workshop on Data Description, Access and Control, pp. 243–256.

M.斯通布雷克。1972b. Forrester 城市地区模型的简化。IEEE 系统、人类和控制论汇刊,2(4):468–472。DOI:10.1109/TSMC.1972.4309156。

M. Stonebraker. 1972b. A simplification of forrester’s model of an urban area. IEEE Transactions on Systems, Man, and Cybernetics, 2(4): 468–472. DOI: 10.1109/TSMC.1972.4309156.

M.斯通布雷克。1974a。部分反演和组合指数的选择。国际期刊并行编程,3(2):167–188。DOI:10.1007/BF00976642。

M. Stonebraker. 1974a. The choice of partial inversions and combined indices. International Journal Parallel Programming, 3(2): 167–188. DOI: 10.1007/BF00976642.

M.斯通布雷克。1974b. 数据独立性的功能视图。在过程中。1974 年 ACM SIGMOD 数据描述、访问和控制研讨会,第 63-81 页。DOI:10.1145/800296.811505。404, 405

M. Stonebraker. 1974b. A functional view of data independence. In Proc. 1974 ACM SIGMOD Workshop on Data Description, Access and Control, pp. 63–81. DOI: 10.1145/800296.811505. 404, 405

M.斯通布雷克。1975 年。INGRES 入门 — 教程,备忘录编号 ERL-M518,加州大学伯克利分校工程学院电子研究实验室。196

M. Stonebraker. 1975. Getting started in INGRES—A tutorial, Memorandum No. ERL-M518, Electronics Research Laboratory, College of Engineering, UC Berkeley. 196

M.斯通布雷克。1975.通过查询修改实现完整性约束和视图。在过程中。ACM SIGMOD 国际数据管理会议,第 65-78 页。DOI:10.1145/500080.500091。45、90

M. Stonebraker. 1975. Implementation of integrity constraints and views by query modification. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 65–78. DOI: 10.1145/500080.500091. 45, 90

M.斯通布雷克。1976a。网络 INGRES 的提案。在过程中。第一届伯克利分布式数据管理和计算机网络研讨会,第 14 页。132.

M. Stonebraker. 1976a. Proposal for a network INGRES. In Proc. 1st Berkeley Workshop on Distributed Data Management and Computer Networks, p. 132.

M.斯通布雷克。1976b. 数据库管理系统 INGRES。在过程中。第一届伯克利分布式数据管理和计算机网络研讨会,第 14 页。336.195

M. Stonebraker. 1976b. The data base management system INGRES. In Proc. 1st Berkeley Workshop on Distributed Data Management and Computer Networks, p. 336. 195

M.斯通布雷克。1976c。关系数据库系统中链接和二级索引的使用比较。在过程中。第二届国际软件工程会议,第 527–531 页。http://dl.acm.org/itation.cfm?id=807727

M. Stonebraker. 1976c. A comparison of the use of links and secondary indices in a relational data base system. In Proc. 2nd International Conference on Software Engineering, pp. 527–531. http://dl.acm.org/citation.cfm?id=807727.

M.斯通布雷克。1978.分布式INGRES中数据多副本的并发控制和一致性。在过程中。第三届伯克利分布式数据管理和计算机网络研讨会,第 235-258 页。90, 398

M. Stonebraker. 1978. Concurrency control and consistency of multiple copies of data in distributed INGRES. In Proc. 3rd Berkeley Workshop on Distributed Data Management and Computer Networks, pp. 235–258. 90, 398

M.斯通布雷克。1979 年 5 月a. Muffin:分布式数据库机。技术报告 ERL 技术报告 UCB/ERL M79/28,加州大学伯克利分校。151

M. Stonebraker. May 1979a. Muffin: A distributed database machine. Technical Report ERL Technical Report UCB/ERL M79/28, University of California at Berkeley. 151

M.斯通布雷克。1979b. 分布式INGRES中数据多副本的并发控制和一致性。IEEE 软件工程汇刊,5(3):188–194。DOI:10.1109/TSE.1979.234180。第398章

M. Stonebraker. 1979b. Concurrency control and consistency of multiple copies of data in distributed INGRES. IEEE Transactions on Software Engineering, 5(3): 188–194. DOI: 10.1109/TSE.1979.234180. 398

M.斯通布雷克。1980.数据库系统回顾。ACM 数据库系统交易,5(2):225–240。DOI:10.1145/320141.320158。

M. Stonebraker. 1980. Retrospection on a database system. ACM Transactions on Database Systems, 5(2): 225–240. DOI: 10.1145/320141.320158.

M.斯通布雷克。1981a。操作系统支持数据库管理。ACM 通讯,24(7):412–418。DOI:10.1145/358699.358703。

M. Stonebraker. 1981a. Operating system support for database management. Communications of the ACM, 24(7): 412–418. DOI: 10.1145/358699.358703.

M.斯通布雷克。1981b. 主席专栏。ACM SIGMOD 记录,11(3):i–iv。

M. Stonebraker. 1981b. Chairman’s column. ACM SIGMOD Record, 11(3): i–iv.

M.斯通布雷克。1981c。主席专栏。ACM SIGMOD 记录,11(4):2-4。

M. Stonebraker. 1981c. Chaiman’s column. ACM SIGMOD Record, 11(4): 2–4.

M.斯通布雷克。1981d. 主席专栏。ACM SIGMOD 记录,12(1):1–3。

M. Stonebraker. 1981d. Chairman’s column. ACM SIGMOD Record, 12(1): 1–3.

M.斯通布雷克。1981e。纪念凯文·惠特尼。ACM SIGMOD 记录,12(1):7。

M. Stonebraker. 1981e. In memory of Kevin Whitney. ACM SIGMOD Record, 12(1): 7.

M.斯通布雷克。1981f. 主席专栏。ACM SIGMOD 记录,11(1):1–4。

M. Stonebraker. 1981f. Chairman’s column. ACM SIGMOD Record, 11(1): 1–4.

M.斯通布雷克。1981克。作为视图的假设数据库。在过程中。ACM SIGMOD 国际数据管理会议,第 224-229 页。DOI:10.1145/582318.582352。

M. Stonebraker. 1981g. Hypothetical data bases as views. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 224–229. DOI: 10.1145/582318.582352.

M.斯通布雷克。1982a。主席专栏。ACM SIGMOD 记录,12(3):2-4。

M. Stonebraker. 1982a. Chairman’s column. ACM SIGMOD Record, 12(3): 2–4.

M.斯通布雷克。1982b. 给 Peter Denning 的信(两次 VLDB 会议)。ACM SIGMOD 记录,12(3):6–7。

M. Stonebraker. 1982b. Letter to Peter Denning (two VLDB conferences). ACM SIGMOD Record, 12(3): 6–7.

M.斯通布雷克。1982c。主席专栏。ACM SIGMOD 记录,12(4):a–c。

M. Stonebraker. 1982c. Chairman’s column. ACM SIGMOD Record, 12(4): a–c.

M.斯通布雷克。1982d. 主席专栏。ACM SIGMOD 记录,13(1):2–3&4。

M. Stonebraker. 1982d. Chairman’s column. ACM SIGMOD Record, 13(1): 2–3&4.

M.斯通布雷克。1982e。将语义知识添加到关系数据库系统中。ML Brodie、M. John 和 SJW 编辑,论概念建模,第 333-352 页。施普林格。DOI:10.1007/978-1-4612-5196-5_12。

M. Stonebraker. 1982e. Adding semantic knowledge to a relational database system. In M. L. Brodie, M. John, and S. J. W., editors, On Conceptual Modelling, pp. 333–352. Springer. DOI: 10.1007/978-1-4612-5196-5_12.

M.斯通布雷克。1982f. 数据库视角。ML Brodie、M. John 和 SJW 编辑,论概念建模,第 457-458 页。施普林格。DOI:10.1007/978-1-46l2-5l96-5_18。

M. Stonebraker. 1982f. A database perspective. In M. L. Brodie, M. John, and S. J. W., editors, On Conceptual Modelling, pp. 457–458. Springer. DOI: 10.1007/978-1-46l2-5l96-5_18.

M.斯通布雷克。1983a。DBMS 和 AI:有什么共同的观点吗?在过程中。ACM SIGMOD 国际数据管理会议,p。134.DOI:10.1145/582192.582215。201, 205

M. Stonebraker. 1983a. DBMS and AI: is there any common point of view? In Proc. ACM SIGMOD International Conference on Management of Data, p. 134. DOI: 10.1145/582192.582215. 201, 205

M.斯通布雷克。1983 年 4 月 b. 主席专栏。ACM SIGMOD 记录,13(3):1-3。

M. Stonebraker. April 1983b. Chairman’s column. ACM SIGMOD Record, 13(3): 1–3.

M.斯通布雷克。1983 年 1 月c. 主席专栏。ACM SIGMOD 记录,13(2):1-3。

M. Stonebraker. January 1983c. Chairman’s column. ACM SIGMOD Record, 13(2): 1–3.

M.斯通布雷克。1984.虚拟内存事务管理。操作系统评论,18(2):8–16。DOI:10.1145/850755.850757。203

M. Stonebraker. 1984. Virtual memory transaction management. Operating Systems Review, 18(2): 8–16. DOI: 10.1145/850755.850757. 203

M.斯通布雷克。1985a. 数据库系统中的触发器和推理。在过程中。1985 年 ACM 计算范围年会:80 年代中期的观点,第 14 页。426.DOI:10.1145/320435.323372。

M. Stonebraker. 1985a. Triggers and inference in data base systems. In Proc. 1985 ACM Annual Conference on the Range of Computing: Mid-80’s Perspective, p. 426. DOI: 10.1145/320435.323372.

M.斯通布雷克。1985b. 数据库系统中的触发器和推理。ML Brodie 和 J. Mylopoulos,编辑,论知识库管理系统,第 297-314 页。施普林格。第202章

M. Stonebraker. 1985b. Triggers and inference in database systems. In M. L. Brodie and J. Mylopoulos, editors, On Knowledge Base Management Systems, pp. 297–314. Springer. 202

M.斯通布雷克。1985c。专家数据库系统/专家和系统专家的基础。在Journées Bases de Données Avancés中。_

M. Stonebraker. 1985c. Expert database systems/bases de données et systèmes experts. In Journées Bases de Données Avancés.

M.斯通布雷克。1985d. 没有任何共享的情况。在过程中。高性能交易系统国际研讨会,第 14 页。0. 91

M. Stonebraker. 1985d. The case for shared nothing. In Proc. International Workshop on High-Performance Transaction Systems, p. 0. 91

M.斯通布雷克。1985e。对数据库系统进行基准测试的技巧。季刊 IEEE 数据工程技术委员会,8(1):10–18。http://sites.computer.org/debull/85MAR-CD.pdf

M. Stonebraker. 1985e. Tips on benchmarking data base systems. Quarterly Bulletin IEEE Technical Committee on Data Engineering, 8(1): 10–18. http://sites.computer.org/debull/85MAR-CD.pdf.

M.斯通布雷克,编辑。1986a. INGRES 论文:关系数据库系统剖析。艾迪生-韦斯利。

M. Stonebraker, editor. 1986a. The INGRES Papers: Anatomy of a Relational Database System. Addison-Wesley.

M.斯通布雷克。1986b. 在关系数据库系统中包含新类型。在过程中。第二届国际数据工程会议,第 262-269 页。DOI:10.1109/ICDE.1986.7266230。88、202、258

M. Stonebraker. 1986b. Inclusion of new types in relational data base systems. In Proc. 2nd International Conference on Data Engineering, pp. 262–269. DOI: 10.1109/ICDE.1986.7266230. 88, 202, 258

M.斯通布雷克。1986c. Postgres 中的对象管理使用过程。在过程中。面向对象数据库系统国际研讨会,第 66-72 页。http://dl.acm.org/itation.cfm?id=318840。45、88、399

M. Stonebraker. 1986c. Object management in Postgres using procedures. In Proc. International Workshop on Object-Oriented Database Systems, pp. 66–72. http://dl.acm.org/citation.cfm?id=318840. 45, 88, 399

M.斯通布雷克。1986d. 没有任何共享的情况。季刊 IEEE 数据工程技术委员会,9(1):4–9。http://sites.computer.org/debull/86MAR-CD.pdf。91, 216

M. Stonebraker. 1986d. The case for shared nothing. Quarterly Bulletin IEEE Technical Committee on Data Engineering, 9(1): 4–9. http://sites.computer.org/debull/86MAR-CD.pdf. 91, 216

M.斯通布雷克。1986 年。关系系统的设计(第 1 节介绍)。摘自 M. Stonebraker,编辑,《INGRES 论文:关系数据库系统剖析》,第 1-3 页。艾迪生-韦斯利。

M. Stonebraker. 1986e. Design of relational systems (introduction to section 1). In M. Stonebraker, editor, The INGRES Papers: Anatomy of a Relational Database System, pp. 1–3. Addison-Wesley.

M.斯通布雷克。1986f. 支持关系系统研究(第 2 节介绍)。见 M. Stonebraker,编辑,The INGRES Papers,第 83-85 页。艾迪生-韦斯利。

M. Stonebraker. 1986f. Supporting studies on relational systems (introduction to section 2). In M. Stonebraker, editor, The INGRES Papers, pp. 83–85. Addison-Wesley.

M.斯通布雷克。1986克。分布式数据库系统(第 3 节介绍)。摘自 M. Stonebraker,编辑,《INGRES 论文:关系数据库系统剖析》,第 183-186 页。艾迪生-韦斯利。

M. Stonebraker. 1986g. Distributed database systems (introduction to section 3). In M. Stonebraker, editor, The INGRES Papers: Anatomy of a Relational Database System, pp. 183–186. Addison-Wesley.

M.斯通布雷克。1986h。分布式INGRES的设计与实现。摘自 M. Stonebraker,编辑,《INGRES 论文:关系数据库系统剖析》,第 187-196 页。艾迪生-韦斯利。

M. Stonebraker. 1986h. The design and implementation of distributed INGRES. In M. Stonebraker, editor, The INGRES Papers: Anatomy of a Relational Database System, pp. 187–196. Addison-Wesley.

M.斯通布雷克。1986i. 数据库系统的用户界面(第 4 节介绍)。摘自 M. Stonebraker,编辑,《INGRES 论文:关系数据库系统剖析》,第 243-245 页。艾迪生-韦斯利。

M. Stonebraker. 1986i. User interfaces for database systems (introduction to section 4). In M. Stonebraker, editor, The INGRES Papers: Anatomy of a Relational Database System, pp. 243–245. Addison-Wesley.

M.斯通布雷克。1986j。关系模型的扩展语义(第 5 节介绍)。摘自 M. Stonebraker,编辑,《INGRES 论文:关系数据库系统剖析》,第 313-316 页。艾迪生-韦斯利。

M. Stonebraker. 1986j. Extended semantics for the relational model (introduction to section 5). In M. Stonebraker, editor, The INGRES Papers: Anatomy of a Relational Database System, pp. 313–316. Addison-Wesley.

M.斯通布雷克。1986k。数据库设计(第 6 节介绍)。摘自 M. Stonebraker,编辑,《INGRES 论文:关系数据库系统剖析》,第 393-394 页。艾迪生-韦斯利。

M. Stonebraker. 1986k. Database design (introduction to section 6). In M. Stonebraker, editor, The INGRES Papers: Anatomy of a Relational Database System, pp. 393–394. Addison-Wesley.

M.斯通布雷克。1986l. 关系数据库系统中的对象管理。论文摘要 - COMPCON,第 336–341 页。

M. Stonebraker. 1986l. Object management in a relational data base system. In Digest of Papers - COMPCON, pp. 336–341.

M.斯通布雷克。1987. POSTGRES存储系统的设计。在过程中。第 13 届超大型数据库国际会议,第 289-300 页。http://www.vldb.org/conf/1987/P289.pdf。168、214、258

M. Stonebraker. 1987. The design of the POSTGRES storage system. In Proc. 13th International Conference on Very Large Data Bases, pp. 289–300. http://www.vldb.org/conf/1987/P289.pdf. 168, 214, 258

M.斯通布雷克,编辑。1988a. 数据库系统读物。摩根·考夫曼。

M. Stonebraker, editor. 1988a. Readings in Database Systems. Morgan Kaufmann.

M.斯通布雷克。1988b. 数据库系统的未来趋势。在过程中。第四届国际数据工程会议,第 222-231 页。DOI:10.1109/ICDE.1988.105464。

M. Stonebraker. 1988b. Future trends in data base systems. In Proc. 4th International Conference on Data Engineering, pp. 222–231. DOI: 10.1109/ICDE.1988.105464.

M.斯通布雷克。1989a. 部分索引的情况。ACM SIGMOD 记录,18(4):4–11。DOI:10.1145/74120.74121。

M. Stonebraker. 1989a. The case for partial indexes. ACM SIGMOD Record, 18(4): 4–11. DOI: 10.1145/74120.74121.

M.斯通布雷克。1989b. 数据库系统的未来趋势。IEEE 知识与数据工程汇刊,1(1):33–44。DOI:10.1109/69.43402。

M. Stonebraker. 1989b. Future trends in database systems. IEEE Transactions on Knowledge and Data Engineering, 1(1): 33–44. DOI: 10.1109/69.43402.

M.斯通布雷克。1990a. 第三代数据库宣言:简要回顾。在过程中。IFIP TC2/WG 2.6 面向对象数据库工作会议:分析、设计和构建,第 71-72 页。

M. Stonebraker. 1990a. The third-generation database manifesto: A brief retrospection. In Proc. IFIP TC2/WG 2.6 Working Conference on Object-Oriented Databases: Analysis, Design & Construction, pp. 71–72.

M.斯通布雷克。1990b. 未来数据库系统的架构。季刊 IEEE 数据工程技术委员会,13(4):18–23。http://sites.computer.org/debull/90DEC-CD.pdf

M. Stonebraker. 1990b. Architecture of future data base systems. Quarterly Bulletin IEEE Technical Committee on Data Engineering, 13(4): 18–23. http://sites.computer.org/debull/90DEC-CD.pdf.

M.斯通布雷克。1990c。伯克利大学的数据库研究。ACM SIGMOD 记录,19(4):113–118。DOI:10.1145/122058.122072。

M. Stonebraker. 1990c. Data base research at Berkeley. ACM SIGMOD Record, 19(4): 113–118. DOI: 10.1145/122058.122072.

M.斯通布雷克。1990d. 介绍数据库原型系统的特刊。IEEE 知识与数据工程汇刊,2(1):1-3。DOI:10.1109/TKDE.1990.10000。

M. Stonebraker. 1990d. Introduction to the special issue on database prototype systems. IEEE Transactions on Knowledge and Data Engineering, 2(1): 1–3. DOI: 10.1109/TKDE.1990.10000.

M.斯通布雷克。1990e。Postgres 数据库管理系统。在过程中。ACM SIGMOD 国际数据管理会议,p。394.

M. Stonebraker. 1990e. The Postgres DBMS. In Proc. ACM SIGMOD International Conference on Management of Data, p. 394.

M.斯通布雷克。1991a. 管理多级存储中的持久对象。在过程中。ACM SIGMOD 国际数据管理会议,第 2-11 页。DOI:10.1145/115790.115791。

M. Stonebraker. 1991a. Managing persistent objects in a multi-level store. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 2–11. DOI: 10.1145/115790.115791.

M.斯通布雷克。1991b. Postgres 中的对象管理使用过程。载于 KR Dittrich、U. Dayal 和 AP Buchmann,编辑,《论面向对象数据库系统》,第 53-64 页。施普林格。DOI:http://10.1007/978-3-642-84374-7_5

M. Stonebraker. 1991b. Object management in Postgres using procedures. In K. R. Dittrich, U. Dayal, and A. P. Buchmann, editors, On Object-Oriented Database System, pp. 53–64. Springer. DOI: http://10.1007/978-3-642-84374-7_5.

M.斯通布雷克。1971. 随机链大规模马尔可夫模型的简化。博士 论文。密歇根大学,密歇根州安娜堡。AAI7123885。43

M. Stonebraker. 1971. The reduction of large scale Markov models for random chains. Ph.D. Dissertation. University of Michigan, Ann Arbor, MI. AAI7123885. 43

M.斯通布雷克。1992a. 规则系统和数据库系统的集成。IEEE 知识与数据工程汇刊,4(5):415–423。DOI:10.1109/69.166984。91

M. Stonebraker. 1992a. The integration of rule systems and database systems. IEEE Transactions on Knowledge and Data Engineering, 4(5): 415–423. DOI: 10.1109/69.166984. 91

M.斯通布雷克,编辑。1992b. 1992 年 ACM SIGMOD 国际数据管理会议论文集。ACM 出版社。

M. Stonebraker, editor. 1992b. Proceedings of the 1992 ACM SIGMOD International Conference on Management of Data. ACM Press.

M.斯通布雷克。1993a. SEQUOIA 2000 项目。季刊 IEEE 数据工程技术委员会,16(1):24–28。http://sites.computer.org/debull/93MAR-CD.pdf

M. Stonebraker. 1993a. The SEQUOIA 2000 project. Quarterly Bulletin IEEE Technical Committee on Data Engineering, 16(1): 24–28. http://sites.computer.org/debull/93MAR-CD.pdf.

M.斯通布雷克。1993b. 我们正在抛光圆球吗?(小组摘要)。在过程中。第九届国际数据工程会议,p。606.

M. Stonebraker. 1993b. Are we polishing a round ball? (panel abstract). In Proc. 9th International Conference on Data Engineering, p. 606.

M.斯通布雷克。1993c. 米罗数据库管理系统。在过程中。ACM SIGMOD 国际数据管理会议,p。439.DOI:10.1145/170035.170124。314

M. Stonebraker. 1993c. The miro DBMS. In Proc. ACM SIGMOD International Conference on Management of Data, p. 439. DOI: 10.1145/170035.170124. 314

M.斯通布雷克。1994a. SEQUOIA 2000:前三年的回顾。在过程中。第七届科学和统计数据库管理国际工作会议,第 108-116 页。DOI:10.1109/SSDM.1994.336956。

M. Stonebraker. 1994a. SEQUOIA 2000: A reflection of the first three years. In Proc. 7th International Working Conference on Scientific and Statistical Database Management, pp. 108–116. DOI: 10.1109/SSDM.1994.336956.

M.斯通布雷克,编辑。1994b. 数据库系统读物,2. Morgan Kaufmann。

M. Stonebraker, editor. 1994b. Readings in Database Systems, 2. Morgan Kaufmann.

M.斯通布雷克。1994c. 遗留系统——缩小规模的致命弱点(面板)。在过程中。第三届并行和分布式信息系统国际会议,p。108.

M. Stonebraker. 1994c. Legacy systems—the Achilles heel of downsizing (panel). In Proc. 3rd International Conference on Parallel and Distributed Information Systems, p. 108.

M.斯通布雷克。1994d. 纪念钜泽 (Bob Kooi) (1951-1993)。ACM SIGMOD 记录,23(1):3。DOI:10.1145/181550.181551。

M. Stonebraker. 1994d. In memory of Bob Kooi (1951-1993). ACM SIGMOD Record, 23(1): 3. DOI: 10.1145/181550.181551.

M.斯通布雷克。1995 年。红杉 2000 项目概述。数字技术期刊,7(3)。http://www.hpl.hp.com/hpjournal/dtj/vol7num3/vol7num3art3.pdf。215, 255

M. Stonebraker. 1995. An overview of the Sequoia 2000 project. Digital Technical Journal, 7(3). http://www.hpl.hp.com/hpjournal/dtj/vol7num3/vol7num3art3.pdf. 215, 255

M.斯通布雷克。1998.我们正在解决正确的问题吗?(控制板)。在过程中。ACM SIGMOD 国际数据管理会议,p。496.DOI:10.1145/276304.276348。

M. Stonebraker. 1998. Are we working on the right problems? (panel). In Proc. ACM SIGMOD International Conference on Management of Data, p. 496. DOI: 10.1145/276304.276348.

M.斯通布雷克。2002 年。中间件太多。ACM SIGMOD 记录,31(1):97–106。DOI:10.1145/507338.507362。91

M. Stonebraker. 2002. Too much middleware. ACM SIGMOD Record, 31(1): 97–106. DOI: 10.1145/507338.507362. 91

M.斯通布雷克。2003 年。Visionary:下一代数据库可视化系统。在过程中。ACM SIGMOD 国际数据管理会议,p。635.http: //www.acm.org/sigmod/sigmod03/eproceedings/papers/ind00.pdf

M. Stonebraker. 2003. Visionary: A next generation visualization system for databases. In Proc. ACM SIGMOD International Conference on Management of Data, p. 635. http://www.acm.org/sigmod/sigmod03/eproceedings/papers/ind00.pdf.

M.斯通布雷克。2004年。剃须时出现令人震惊的想法和/或想法。在过程中。第 20 届国际数据工程会议,第 19 页。869.DOI:10.1109/ICDE.2004.1320096。

M. Stonebraker. 2004. Outrageous ideas and/or thoughts while shaving. In Proc. 20th International Conference on Data Engineering, p. 869. DOI: 10.1109/ICDE.2004.1320096.

M.斯通布雷克。2008a. 吉姆·格雷为何获得图灵奖?ACM SIGMOD 记录,37(2):33–34。DOI:10.1145/1379387.1379398。

M. Stonebraker. 2008a. Why did Jim Gray win the Turing Award? ACM SIGMOD Record, 37(2): 33–34. DOI: 10.1145/1379387.1379398.

M.斯通布雷克。2008b. 技术角度——一刀切:一个想法的时代已经过去了。ACM 通信,51(12):76。DOI:10.1145/1409360.1409379。92

M. Stonebraker. 2008b. Technical perspective—one size fits all: An idea whose time has come and gone. Communications of the ACM, 51(12): 76. DOI: 10.1145/1409360.1409379. 92

M.斯通布雷克。2009a. 流处理。L. Liu 和 MT Ozsu,编辑。数据库系统百科全书,第 2837–2838 页。施普林格。DOI:10.1007/978-0-387-39940-9_371。

M. Stonebraker. 2009a. Stream processing. In L. Liu and M. T. Ozsu, editors. Encyclopedia of Database Systems, pp. 2837–2838. Springer. DOI: 10.1007/978-0-387-39940-9_371.

M.斯通布雷克。2009b. tpc的新方向?在过程中。第一届 TPC 性能评估和基准技术会议,第 11-17 页。DOI:10.1007/978-3-642-10424-4_2。

M. Stonebraker. 2009b. A new direction for tpc? In Proc. 1st TPC Technology Conference on Performance Evaluation and Benchmarking, pp. 11–17. DOI: 10.1007/978-3-642-10424-4_2.

M.斯通布雷克。2010a. SQL 数据库与 nosql 数据库。ACM 通讯,53(4):10-11。DOI:10.1145/1721654.1721659。50

M. Stonebraker. 2010a. SQL databases v. nosql databases. Communications of the ACM, 53(4): 10–11. DOI: 10.1145/1721654.1721659. 50

M.斯通布雷克。2010b. 寻求数据库一致性。ACM 通讯,53(10):8-9。DOI:10.1145/1831407.1831411。

M. Stonebraker. 2010b. In search of database consistency. Communications of the ACM, 53(10): 8–9. DOI: 10.1145/1831407.1831411.

M.斯通布雷克。2011a. Stonebraker 的数据仓库。ACM 通讯,54(5):10-11。DOI:10.1145/1941487.1941491。

M. Stonebraker. 2011a. Stonebraker on data warehouses. Communications of the ACM, 54(5): 10–11. DOI: 10.1145/1941487.1941491.

M.斯通布雷克。2011b. 关于 nosql 和企业的 Stonebraker。ACM 通讯,54(8):10-11。DOI:10.1145/1978542.1978546。50

M. Stonebraker. 2011b. Stonebraker on nosql and enterprises. Communications of the ACM, 54(8): 10–11. DOI: 10.1145/1978542.1978546. 50

M.斯通布雷克。2012a. SciDB:科学数据的开源 DBMS。ERCIM新闻,2012(89)。http://ercim-news.ercim.eu/en89/special/scidb-an-open-source-dbms-for-scientific-data

M. Stonebraker. 2012a. SciDB: An open-source DBMS for scientific data. ERCIM News, 2012(89). http://ercim-news.ercim.eu/en89/special/scidb-an-open-source-dbms-for-scientific-data.

M.斯通布雷克。2012b. 新 SQL 的新机遇。ACM 通讯,55(11):10-11。DOI:10.1145/2366316.2366319。

M. Stonebraker. 2012b. New opportunities for new SQL. Communications of the ACM, 55(11): 10–11. DOI: 10.1145/2366316.2366319.

M.斯通布雷克。2013年,我们受到攻击;由最少可发表的单位。在过程中。第六届创新数据系统研究双年会http://www.cidrdb.org/cidr2013/Talks/CIDR13_Gongshow16.ppt。273

M. Stonebraker. 2013. We are under attack; by the least publishable unit. In Proc. 6th Biennial Conference on Innovative Data Systems Research. http://www.cidrdb.org/cidr2013/Talks/CIDR13_Gongshow16.ppt. 273

M.斯通布雷克。2015a. 图灵讲座。在过程中。联合计算研究会议,第 2-2 页。DOI:10.1145/2820468.2820471。

M. Stonebraker. 2015a. Turing lecture. In Proc. Federated Computing Research Conference, pp. 2–2. DOI: 10.1145/2820468.2820471.

M.斯通布雷克。2015b. 获得图灵奖是什么感觉?ACM 通讯,58(11): 11. xxxi, xxxiii

M. Stonebraker. 2015b. What it’s like to win the Turing Award. Communications of the ACM, 58(11): 11. xxxi, xxxiii

M.斯通布雷克。2015c. Polystore 案例。ACM SIGMOD 博客,http://wp.sigmod.org/?p =1629 。370, 371

M. Stonebraker. 2015c. The Case for Polystores. ACM SIGMOD Blog, http://wp.sigmod.org/?p=1629. 370, 371

M.斯通布雷克。2016 年。陆鲨出现在尖叫箱上。ACM 通讯,59(2):74-83。DOI:10.1145/2869958。50、129、139、260、319

M. Stonebraker. 2016. The land sharks are on the squawk box. Communications of the ACM, 59(2): 74–83. DOI: 10.1145/2869958. 50, 129, 139, 260, 319

M.斯通布雷克。2018 年。我对 DBMS 领域最担心的十大问题。在过程中。第 34 届国际数据工程会议,第 24-28 页。

M. Stonebraker. 2018. My top ten fears about the DBMS field. In Proc. 34th International Conference on Data Engineering, pp. 24–28.

M.沙利文和M.斯通布雷克。1991. 使用写保护数据结构来提高高可用性数据库管理系统中的软件容错能力。在过程中。第 17 届超大型数据库国际会议,第 171-180 页。http://www.vldb.org/conf/1991/P171.pdf

M. Sullivan and M. Stonebraker. 1991. Using write protected data structures to improve software fault tolerance in highly available database management systems. In Proc. 17th International Conference on Very Large Data Bases, pp. 171–180. http://www.vldb.org/conf/1991/P171.pdf.

R.塔夫脱、E.曼苏尔、M.塞拉菲尼、J.杜根、AJ埃尔莫尔、A.阿布纳加、A.帕夫洛和M.斯通布雷克。2014a. 电子商店:用于分布式事务处理的细粒度弹性分区。过程。VLDB 捐赠,8(3):245–256。http://www.vldb.org/pvldb/vol8/p245-taft.pdf。188, 251

R. Taft, E. Mansour, M. Serafini, J. Duggan, A. J. Elmore, A. Aboulnaga, A. Pavlo, and M. Stonebraker. 2014a. E-store: Fine-grained elastic partitioning for distributed transaction processing. Proc. VLDB Endowment, 8(3): 245–256. http://www.vldb.org/pvldb/vol8/p245-taft.pdf. 188, 251

R. Taft、M. Vartak、NR Satish、N. Sundaram、S. Madden 和 M. Stonebraker。2014b. Genbase:复杂的分析基因组学基准。在过程中。ACM SIGMOD 国际数据管理会议,第 177-188 页。DOI:10.1145/2588555.2595633。

R. Taft, M. Vartak, N. R. Satish, N. Sundaram, S. Madden, and M. Stonebraker. 2014b. Genbase: a complex analytics genomics benchmark. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 177–188. DOI: 10.1145/2588555.2595633.

R. Taft、W. Lang、J. Duggan、AJ Elmore、M. Stonebraker 和 DJ DeWitt。2016 年。步骤:用于管理数据库即服务部署的可扩展租户布局。在过程中。第七届 ACM 云计算研讨会,第 388-400 页。DOI:10.1145/2987550.2987575。

R. Taft, W. Lang, J. Duggan, A. J. Elmore, M. Stonebraker, and D. J. DeWitt. 2016. Step: Scalable tenant placement for managing database-as-a-service deployments. In Proc. 7th ACM Symposium on Cloud Computing, pp. 388–400. DOI: 10.1145/2987550.2987575.

R. Taft、N. El-Sayed、M. Serafini、Y. Lu、A. Aboulnaga、M. Stonebraker、R. Mayerhofer 和 F. Andrade。2018. P-Store:具有预测性配置的弹性数据库系统。在过程中。ACM SIGMOD 国际数据管理会议。188

R. Taft, N. El-Sayed, M. Serafini, Y. Lu, A. Aboulnaga, M. Stonebraker, R. Mayerhofer, and F. Andrade. 2018. P-Store: an elastic database system with predictive provisioning. In Proc. ACM SIGMOD International Conference on Management of Data. 188

W.陶、D.邓和M.斯通布雷克。2017. 近似字符串与缩写的连接。过程。VLDB 捐赠,11(1):53–65。

W. Tao, D. Deng, and M. Stonebraker. 2017. Approximate string joins with abbreviations. Proc. VLDB Endowment, 11(1): 53–65.

N. Tatbul、U. Çetintemel、SB Zdonik、M. Cherniack 和 M. Stonebraker。2003 年。数据流管理器中的负载卸载。在过程中。第 29 届国际超大型数据库会议,第 309-320 页。http://www.vldb.org/conf/2003/papers/S10P03.pdf。228, 229

N. Tatbul, U. Çetintemel, S. B. Zdonik, M. Cherniack, and M. Stonebraker. 2003. Load shedding in a data stream manager. In Proc. 29th International Conference on Very Large Data Bases, pp. 309–320. http://www.vldb.org/conf/2003/papers/S10P03.pdf. 228, 229

N. Tatbul、S. Zdonik、J. Meehan、C. Aslantas、M. Stonebraker、K. Tufte、C. Giossi 和 H. Quach。2015。在流处理中处理共享的、可变的状态并保证正确性。季刊 IEEE 数据工程技术委员会,38(4):94–104。http://sites.computer.org/debull/A15dec/p94.pdf

N. Tatbul, S. Zdonik, J. Meehan, C. Aslantas, M. Stonebraker, K. Tufte, C. Giossi, and H. Quach. 2015. Handling shared, mutable state in stream processing with correctness guarantees. Quarterly Bulletin IEEE Technical Committee on Data Engineering, 38(4): 94–104. http://sites.computer.org/debull/A15dec/p94.pdf.

TJ Teorey、JW DeHeus、R. Gerritsen、HL Morgan、JF Spitzer 和 M. Stonebraker。1976.SIGMOD(论文会议)。在过程中。1976 年 ACM 年会,第 14 页。275.DOI:10.1145/800191.805596。

T. J. Teorey, J. W. DeHeus, R. Gerritsen, H. L. Morgan, J. F. Spitzer, and M. Stonebraker. 1976. SIGMOD (paper session). In Proc. 1976 ACM Annual Conference, p. 275. DOI: 10.1145/800191.805596.

MS Tuttle、SH Brown、KE Campbell、JS Carter、K. Keck、MJ Lincoln、SJ Nelson 和 M. Stonebraker。2001a. “追求完美”的语义网:从药物术语来看。在过程中。第一届语义网络工作研讨会,第 5-16 页。http://www.semanticweb.org/SWWS/program/full/paper49.pdf

M. S. Tuttle, S. H. Brown, K. E. Campbell, J. S. Carter, K. Keck, M. J. Lincoln, S. J. Nelson, and M. Stonebraker. 2001a. The semantic web as “perfection seeking”: A view from drug terminology. In Proc. 1st Semantic Web Working Symposium, pp. 5–16. http://www.semanticweb.org/SWWS/program/full/paper49.pdf.

MS Tuttle、SH Brown、KE Campbell、JS Carter、K. Keck、MJ Lincoln、SJ Nelson 和 M. Stonebraker。2001b. “追求完美”的语义网:从药物术语来看。IF Cruz、S. Decker、J. Euzenat 和 DL McGuinness 编辑,《新兴语义网》,《第一届语义网工作研讨会论文选》,《人工智能与应用前沿》第 75 卷。IOS出版社。

M. S. Tuttle, S. H. Brown, K. E. Campbell, J. S. Carter, K. Keck, M. J. Lincoln, S. J. Nelson, and M. Stonebraker. 2001b. The semantic web as “perfection seeking”: A view from drug terminology. In I. F. Cruz, S. Decker, J. Euzenat, and D. L. McGuinness, editors, The Emerging Semantic Web, Selected Papers from the 1st Semantic Web Working Symposium, volume 75 of Frontiers in Artificial Intelligence and Applications. IOS Press.

J. Widom、A. Bosworth、B. Lindsey、M. Stonebraker 和 D. Suciu。2000. XML 和数据库(小组会议):重点在哪里?在过程中。ACM SIGMOD 国际数据管理会议,p。576.DOI:10.1145/335191.335476。

J. Widom, A. Bosworth, B. Lindsey, M. Stonebraker, and D. Suciu. 2000. Of XML and databases (panel session): Where’s the beef? In Proc. ACM SIGMOD International Conference on Management of Data, p. 576. DOI: 10.1145/335191.335476.

MW Wilkins、R. Berlin、T. Payne 和 G. Wiederhold。1985. 超大规模集成电路设计中的关系和实体关系模型数据库和专门设计文件。在过程中。第 22 届 ACM/IEEE 设计自动化会议,第 410–416 页。

M. W. Wilkins, R. Berlin, T. Payne, and G. Wiederhold. 1985. Relational and entity-relationship model databases and specialized design files in vlsi design. In Proc. 22nd ACM/IEEE Design Automation Conference, pp. 410–416.

J.伍德菲尔和M.斯通布雷克。1983. 假设关系的实现。在过程中。第九届国际数据库会议,第 157-166 页。http://www.vldb.org/conf/1983/P157.pdf

J. Woodfill and M. Stonebraker. 1983. An implementation of hypothetical relations. In Proc. 9th International Conference on Very Data Bases, pp. 157–166. http://www.vldb.org/conf/1983/P157.pdf.

A.伍德拉夫和M.斯通布雷克。1995。数据流图中中间结果的缓冲。在过程中。IEEE 视觉语言研讨会,p。187.DOI:10.1109/VL.1995.520808。

A. Woodruff and M. Stonebraker. 1995. Buffering of intermediate results in dataflow diagrams. In Proc. IEEE Symposium on Visual Languages, p. 187. DOI: 10.1109/VL.1995.520808.

A.伍德拉夫和M.斯通布雷克。1997。支持数据库可视化环境中的细粒度数据沿袭。在过程中。第 13 届国际数据工程会议,第 91-102 页。DOI:10.1109/ICDE.1997.581742。

A. Woodruff and M. Stonebraker. 1997. Supporting fine-grained data lineage in a database visualization environment. In Proc. 13th International Conference on Data Engineering, pp. 91–102. DOI: 10.1109/ICDE.1997.581742.

A. 伍德拉夫、P. 维斯诺夫斯基、C. 泰勒、M. 斯通布雷克、C. 帕克森、J. 陈和 A. 艾肯。1994. Tioga 中的缩放和隧道:支持多媒体空间中的导航。在过程中。IEEE 视觉语言研讨会,第 191-193 页。DOI:10.1109/VL.1994.363622。

A. Woodruff, P. Wisnovsky, C. Taylor, M. Stonebraker, C. Paxson, J. Chen, and A. Aiken. 1994. Zooming and tunneling in Tioga: Supporting navigation in multimedia space. In Proc. IEEE Symposium on Visual Languages, pp. 191–193. DOI: 10.1109/VL.1994.363622.

A. Woodruff、A. Su、M. Stonebraker、C. Paxson、J. Chen、A. Aiken、P. Wisnovsky 和 ​​C. Taylor。1995。多维视觉浏览器的导航和协调原语。在过程中。IFIP WG 2.6 第三次工作会议可视化数据库系统,第 360–371 页。DOI:10.1007/978-0-387-34905-3_23。

A. Woodruff, A. Su, M. Stonebraker, C. Paxson, J. Chen, A. Aiken, P. Wisnovsky, and C. Taylor. 1995. Navigation and coordination primitives for multidimensional visual browsers. In Proc. IFIP WG 2.6 3rd Working Conference Visual Database Systems, pp. 360–371. DOI: 10.1007/978-0-387-34905-3_23.

A. 伍德拉夫、JA 兰迪和 M. 斯通布雷克。1998a. 目标导向的变焦。在CHI '98 关于计算系统中人为因素的会议摘要中,第 305-306 页。DOI:10.1145/286498.286781。

A. Woodruff, J. A. Landay, and M. Stonebraker. 1998a. Goal-directed zoom. In CHI ’98 Conference Summary on Human Factors in Computing Systems, pp. 305–306. DOI: 10.1145/286498.286781.

A. 伍德拉夫、JA 兰迪和 M. 斯通布雷克。1998b. 数据非均匀分布的恒定密度可视化。在过程中。第 11 届 ACM 用户界面软件和技术年度研讨会,第 19-28 页。DOI:10.1145/288392.288397。

A. Woodruff, J. A. Landay, and M. Stonebraker. 1998b. Constant density visualizations of non-uniform distributions of data. In Proc. 11th Annual ACM Symposium on User Interface Software and Technology, pp. 19–28. DOI: 10.1145/288392.288397.

A. 伍德拉夫、JA 兰迪和 M. 斯通布雷克。1998c. 可缩放界面中的信息密度恒定。在过程中。高级视觉界面工作会议,第 57-65 页。DOI:10.1145/948496.948505。

A. Woodruff, J. A. Landay, and M. Stonebraker. 1998c. Constant information density in zoomable interfaces. In Proc. Working Conference on Advanced Visual Interfaces, pp. 57–65. DOI: 10.1145/948496.948505.

A. 伍德拉夫、JA 兰迪和 M. 斯通布雷克。1999. VIDA:(视觉信息密度调节器)。在CHI '99 计算系统中人为因素的扩展摘要中,第 19-20 页。DOI:10.1145/632716.632730。

A. Woodruff, J. A. Landay, and M. Stonebraker. 1999. VIDA: (visual information density adjuster). In CHI ’99 Extended Abstracts on Human Factors in Computing Systems, pp. 19–20. DOI: 10.1145/632716.632730.

A. Woodruff、C. Olston、A. Aiken、M. Chu、V. Ercegovac、M. Lin、M. Spalding 和 M. Stonebraker。2001. Datasplash:用于对表格数据的语义缩放可视化进行编程的直接操作环境。视觉语言与计算杂志,12(5):551–571。DOI:10.1006/jvlc.2001.0219。

A. Woodruff, C. Olston, A. Aiken, M. Chu, V. Ercegovac, M. Lin, M. Spalding, and M. Stonebraker. 2001. Datasplash: A direct manipulation environment for programming semantic zoom visualizations of tabular data. Journal of Visual Languages and Computing, 12(5): 551–571. DOI: 10.1006/jvlc.2001.0219.

E. Wu、S. Madden 和 M. Stonebraker。2012. dbwipes 演示:查询时清理。过程。VLDB 捐赠,5(12):1894–1897。DOI:10.14778/2367502.2367531。

E. Wu, S. Madden, and M. Stonebraker. 2012. A demonstration of dbwipes: Clean as you query. Proc. VLDB Endowment, 5(12): 1894–1897. DOI: 10.14778/2367502.2367531.

E. Wu、S. Madden 和 M. Stonebraker。2013. Subzero:科学数据库的细粒度谱系系统。在过程中。第 29 届国际数据工程会议,第 865–876 页。DOI:10.1109/ICDE.2013.6544881。

E. Wu, S. Madden, and M. Stonebraker. 2013. Subzero: A fine-grained lineage system for scientific databases. In Proc. 29th International Conference on Data Engineering, pp. 865–876. DOI: 10.1109/ICDE.2013.6544881.

X. Yu、G. Bezerra、A. Pavlo、S. Devadas 和 M. Stonebraker。2014.凝视深渊:千核并发控制评估。过程。VLDB 捐赠,8(3):209–220。http://www.vldb.org/pvldb/vol8/p209-yu.pdf

X. Yu, G. Bezerra, A. Pavlo, S. Devadas, and M. Stonebraker. 2014. Staring into the abyss: An evaluation of concurrency control with one thousand cores. Proc. VLDB Endowment, 8(3): 209–220. http://www.vldb.org/pvldb/vol8/p209-yu.pdf.

K. Yu、V. Gadepally 和 M. Stonebraker。2017. BigDAWG Polystore 系统的数据库引擎集成和性能分析。高性能极限计算会议(HPEC)。IEEE,2017。DOI:10.1109/HPEC.2017.8091081。第376章

K. Yu, V. Gadepally, and M. Stonebraker. 2017. Database engine integration and performance analysis of the BigDAWG polystore system. High Performance Extreme Computing Conference (HPEC). IEEE, 2017. DOI: 10.1109/HPEC.2017.8091081. 376

SB Zdonik、M. Stonebraker、M. Cherniack、U. Çetintemel、M. Balazinska 和 H. Balakrishnan。2003年。极光和美杜莎项目。季刊 IEEE 数据工程技术委员会,26(1):3–10。http://sites.computer.org/debull/A03mar/zdonik.ps。228, 324

S. B. Zdonik, M. Stonebraker, M. Cherniack, U. Çetintemel, M. Balazinska, and H. Balakrishnan. 2003. The aurora and medusa projects. Quarterly Bulletin IEEE Technical Committee on Data Engineering, 26(1): 3–10. http://sites.computer.org/debull/A03mar/zdonik.ps. 228, 324

参考

References

D. Abadi、Y. Ahmad、M. Balazinska、U. Çetintemel、M. Cherniack、J.-H。Hwang、W. Lindner、A. Maskey、A. Rasin、E. Ryvkina、N. Tatbul、Y. Xing 和 S. Zdonik。2005年。Borealis流处理引擎的设计。过程。第二届创新数据系统研究双年度会议( CIDR '05),加利福尼亚州阿西洛玛,一月。228

D. Abadi, Y. Ahmad, M. Balazinska, U. Çetintemel, M. Cherniack, J.-H. Hwang, W. Lindner, A. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S. Zdonik. 2005. The design of the Borealis stream processing engine. Proc. of the 2nd Biennial Conference on Innovative Data Systems Research (CIDR’05), Asilomar, CA, January. 228

Z. Abedjan、L. Golab 和 F. Naumann。2015 年 8 月。分析关系数据:一项调查。VLDB 杂志,24(4):557–581。DOI:DOI:10.1007/s00778-015-0389-y。第297章

Z. Abedjan, L. Golab, and F. Naumann. August 2015. Profiling relational data: a survey. The VLDB Journal, 24(4): 557–581. DOI: DOI: 10.1007/s00778-015-0389-y. 297

ACM。2015a. 公告:数据库系统架构先驱 Michael Stonebraker 荣获 2014 年 ACM 图灵奖。http://amturing.acm.org/award_winners/stonebraker_1172121.cfm。访问日期:2018 年 2 月 5 日。

ACM. 2015a. Announcement: Michael Stonebraker, Pioneer in Database Systems Architecture, Receives 2014 ACM Turing Award. http://amturing.acm.org/award_winners/stonebraker_1172121.cfm. Accessed February 5, 2018.

ACM。2015 年 3 月b. 新闻稿:麻省理工学院的 Stonebraker 将关系数据库系统从概念带入了商业成功,为数十亿美元的数据库领域设定了数十年的研究议程。http://sigmodrecord.org/publications/sigmodRecord/1503/pdfs/04_announcements_Stonebraker.pdf。访问日期:2018 年 2 月 5 日。

ACM. March 2015b. Press Release: MIT’s Stonebraker Brought Relational Database Systems from Concept to Commercial Success, Set the Research Agenda for the Multibillion-Dollar Database Field for Decades. http://sigmodrecord.org/publications/sigmodRecord/1503/pdfs/04_announcements_Stonebraker.pdf. Accessed February 5, 2018.

ACM。2016. AM 图灵奖引文和传记。http://amturing.acm.org/award_winners/stonebraker_1172121.cfm。访问日期:2018 年 9 月 24 日。xxxi

ACM. 2016. A.M. Turing Award Citation and Biography. http://amturing.acm.org/award_winners/stonebraker_1172121.cfm. Accessed September 24, 2018. xxxi

Y. Ahmad、B. Berg、U. Çetintemel、M. Humphrey、J. Hwang、A. Jhingran、A. Maskey、O. Papaemmanouil、A. Rasin、N. Tatbul、W. Xing、Y. Xing 和 S .兹多尼克。2005 年 6 月。Borealis 流处理引擎中的分布式操作。演示,ACM SIGMOD 国际数据管理会议 (SIGMOD'05)。马里兰州巴尔的摩 最佳示范奖。230, 325

Y. Ahmad, B. Berg, U. Çetintemel, M. Humphrey, J. Hwang, A. Jhingran, A. Maskey, O. Papaemmanouil, A. Rasin, N. Tatbul, W. Xing, Y. Xing, and S. Zdonik. June 2005. Distributed operation in the Borealis Stream Processing Engine. Demonstration, ACM SIGMOD International Conference on Management of Data (SIGMOD’05). Baltimore, MD. Best Demonstration Award. 230, 325

MM Astrahan、MW Blasgen、DD Chamberlin、KP Eswaran、JN Gray、PP Griffiths、WF King、RA Lorie、PR McJones、JW Mehl、GR Putzolu、IL Traiger、BW Wade 和 V. Watson。1976. System R:数据库管理的关系方法。ACM 数据库系统交易,1(2):97–137。DOI:10.1145/320455.320457。第397章

M. M. Astrahan, M. W. Blasgen, D. D. Chamberlin, K. P. Eswaran, J. N. Gray, P. P. Griffiths, W. F. King, R. A. Lorie, P. R. McJones, J. W. Mehl, G. R. Putzolu, I. L. Traiger, B. W. Wade, and V. Watson. 1976. System R: relational approach to database management. ACM Transactions on Database Systems, 1(2): 97–137. DOI: 10.1145/320455.320457. 397

P. Bailis、E. Gan、S. Madden、D. Narayanan、K. Rong 和 S. Suri。2017.Macrobase:优先关注快速数据。过程。2017 年 ACM 国际数据管理会议。ACM。DOI:10.1145/3035918.3035928。第374章

P. Bailis, E. Gan, S. Madden, D. Narayanan, K. Rong, and S. Suri. 2017. Macrobase: Prioritizing attention in fast data. Proc. of the 2017 ACM International Conference on Management of Data. ACM. DOI: 10.1145/3035918.3035928. 374

伯克利软件发行版。在维基百科中。http://en.wikipedia.org/wiki/Berkeley_Software_Distribution。最后访问时间:2018 年 3 月 1 日。 109

Berkeley Software Distribution. n.d. In Wikipedia. http://en.wikipedia.org/wiki/Berkeley_Software_Distribution. Last accessed March 1, 2018. 109

G. Beskales、IF Ilyas、L. Golab 和 A. Galiullin。2013.关于不一致数据和不准确约束之间的相对信任。过程。IEEE 国际数据工程会议,ICDE 2013,第 541-552 页。澳大利亚。DOI:10.1109/ICDE.2013.6544854。270

G. Beskales, I.F. Ilyas, L. Golab, and A. Galiullin. 2013. On the relative trust between inconsistent data and inaccurate constraints. Proc. of the IEEE International Conference on Data Engineering, ICDE 2013, pp. 541–552. Australia. DOI: 10.1109/ICDE.2013.6544854. 270

LS Blackford、J. Choi、A. Cleary、E. D'Azevedo、J. Demmel、I. Dhillon、J. Dongarra、S. Hammarling、G. Henry、A. Petitet、K. Stanley、D. Walker、RC惠利。2017.ScaLAPACK用户指南。工业与应用数学学会http://netlib.org/scalapack/slug/index.html。最后访问时间:2017 年 12 月 31 日。258

L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, R. C. Whaley. 2017. ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics http://netlib.org/scalapack/slug/index.html. Last accessed December 31, 2017. 258

D. Bitton、DJ DeWitt 和 C. Turbyfill。1983. 数据库系统基准测试——一种系统方法。计算机科学技术报告#526,威斯康星大学。http://minds.wisconsin.edu/handle/1793/58490 , 111

D. Bitton, D.J. DeWitt, and C. Turbyfill. 1983. Benchmarking database systems—a systematic approach. Computer Sciences Technical Report #526, University of Wisconsin. http://minds.wisconsin.edu/handle/1793/58490, 111

PA Boncz、ML Kersten 和 S. Manegold。2008 年 12 月。打破 MonetDB 中的内存墙。ACM 通讯,51(12):77-85。DOI:10.1145/1409360.1409380。151

P. A. Boncz, M. L. Kersten, and S. Manegold. December 2008. Breaking the memory wall in MonetDB. Communications of the ACM, 51(12): 77–85. DOI: 10.1145/1409360.1409380. 151

ML布罗迪。2015 年 6 月。了解数据科学:数据密集型发现的新兴学科。S. Cutt,编辑,《正确获取数据:应对大数据量和多样性的挑战》。O'Reilly Media,塞瓦斯托波尔,加利福尼亚州。第291章

M. L. Brodie. June 2015. Understanding data science: an emerging discipline for dataintensive discovery. In S. Cutt, editor, Getting Data Right: Tackling the Challenges of Big Data Volume and Variety. O’Reilly Media, Sebastopol, CA. 291

布朗大学计算机科学系。2002 年秋季。下一代基于流的应用程序。导管杂志,11(2)。https://cs.brown.edu/about/conduit/conduit_v11n2.pdf。最后访问时间:2018 年 5 月 14 日。322

Brown University, Department of Computer Science. Fall 2002. Next generation stream-based applications. Conduit Magazine, 11(2). https://cs.brown.edu/about/conduit/conduit_v11n2.pdf. Last accessed May 14, 2018. 322

BSD 许可证。在维基百科中。http://en.wikipedia.org/wiki/BSD_licenses。最后访问时间:2018 年 3 月 1 日。 109

BSD licenses. n.d. In Wikipedia. http://en.wikipedia.org/wiki/BSD_licenses. Last accessed March 1, 2018. 109

M. Cafarella 和 C. Ré。2018 年 4 月。数据库研究的最后十年及其令人眼花缭乱的光明未来。或数据库研究:一首情歌。黎明项目,斯坦福大学。http://dawn.cs.stanford.edu/2018/04/11/db-community/。6

M. Cafarella and C. Ré. April 2018. The last decade of database research and its blindingly bright future. or Database Research: A love song. DAWN Project, Stanford University. http://dawn.cs.stanford.edu/2018/04/11/db-community/. 6

MJ Carey、DJ DeWitt、MJ Franklin、N. E Hall、ML McAuliffe、JF Naughton、DT Schuh、MH Solomon、CK Tan、OG Tsatalos、SJ White 和 MJ Zwilling。1994 年。支持持久性应用程序。过程。1994 年 ACM SIGMOD 国际数据管理会议 (SIGMOD '94),383–394。DOI:10.1145/191839.191915。152

M. J. Carey, D. J. DeWitt, M. J. Franklin, N. E Hall, M. L. McAuliffe, J. F. Naughton, D. T. Schuh, M. H. Solomon, C. K. Tan, O. G. Tsatalos, S. J. White, and M. J. Zwilling. 1994. Shoring up persistent applications. Proc. of the 1994 ACM SIGMOD international conference on Management of data (SIGMOD ’94), 383–394. DOI: 10.1145/191839.191915. 152

MJ Carey、DJ Dewitt、MJ Franklin、NE Hall、ML McAuliffe、JF Naughton、DT Schuh、MH Solomon、CK Tan、OG Tsatalos、SJ White 和 MJ Zwilling。1994 年。支持持久性应用程序。在过程中。1994 年 ACM SIGMOD 国际数据管理会议 (SIGMOD '94),第 383–394 页。DOI:10.1145/191839.191915。第336章

M. J. Carey, D. J. Dewitt, M. J. Franklin, N.E. Hall, M. L. McAuliffe, J. F. Naughton, D. T. Schuh, M. H. Solomon, C. K. Tan, O. G. Tsatalos, S. J. White, and M. J. Zwilling. 1994. Shoring up persistent applications. In Proc. of the 1994 ACM SIGMOD International Conference on Management of Data (SIGMOD ’94), pp. 383–394. DOI: 10.1145/191839.191915. 336

MJ Carey、LM Haas、PM Schwarz、M. Arya、WE Cody、R. Fagin、M. Flickner、AW Luniewski、W. Niblack 和 D. Petkovic。1995。走向异构多媒体信息系统:大蒜方法。数据工程研究问题,1995 年:分布式对象管理,论文集,第 124-131 页。IEEE。DOI:10.1109/RIDE.1995.378736。第284章

M. J. Carey, L. M. Haas, P. M. Schwarz, M. Arya, W. E. Cody, R. Fagin, M. Flickner, A. W. Luniewski, W. Niblack, and D. Petkovic. 1995. Towards heterogeneous multimedia information systems: The garlic approach. In Research Issues in Data Engineering, 1995: Distributed Object Management, Proceedings, pp. 124–131. IEEE. DOI: 10.1109/RIDE.1995.378736. 284

欧洲核子研究中心。http://home.cern/about/computing。上次访问时间为 2017 年 12 月 31 日。

CERN. http://home.cern/about/computing. Last accessed December 31, 2017.

DD 张伯伦和 RF 博伊斯。1974. SEQUEL:一种结构化英语查询语言。在过程中。1974 年 ACM SIGFIDET(现在的 SIGMOD)数据描述研讨会访问和控制 (SIGFIDET '74),第 249–264 页。ACM,纽约。DOI:10.1145/800296.811515。404, 407

D. D. Chamberlin and R. F. Boyce. 1974. SEQUEL: A structured English query language. In Proc. of the 1974 ACM SIGFIDET (now SIGMOD) Workshop on Data Description, Access and Control (SIGFIDET ’74), pp. 249–264. ACM, New York. DOI: 10.1145/800296.811515. 404, 407

DD Chamberlin、MM Astrahan、KP Eswaran、PP Griffiths、RA Lorie、JW Mehl、P. Reisner 和 BW Wade。1976. SEQUEL 2:数据定义、操作和控制的统一方法。IBM 研究与开发杂志,20(6):560–575。DOI:10.1147/rd.206.0560。第398章

D. D. Chamberlin, M. M. Astrahan, K. P. Eswaran, P. P. Griffiths, R. A. Lorie, J. W. Mehl, P. Reisner, and B. W. Wade. 1976. SEQUEL 2: a unified approach to data definition, manipulation, and control. IBM Journal of Research and Development, 20(6): 560–575. DOI: 10.1147/rd.206.0560. 398

S. Chandrasekaran、O、Cooper、A. Deshpande、MJ Franklin、JM Hellerstein、W. Hong、S. Krishnamurthy、S. Madden、V. Raman、F. Reiss 和 M. Shah。2003. TelegraphCQ:不确定世界的连续数据流处理。过程。2003 年 ACM SIGMOD 国际数据管理会议 (SIGMOD '03),第 668–668 页。ACM,纽约。DOI:10.1145/872757.872857。第231章

S. Chandrasekaran, O, Cooper, A. Deshpande, M.J. Franklin, J.M. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M. Shah. 2003. TelegraphCQ: Continuous dataflow processing for an uncertain world. Proc. of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD ’03), pp. 668–668. ACM, New York. DOI:10.1145/872757.872857. 231

J. Chen、DJ DeWitt、F. Tian 和 Y. Wang。2000. NiagaraCQ:一个可扩展的互联网数据库连续查询系统。过程。2000 年 ACM SIGMOD 国际数据管理会议 (SIGMOD '00),第 379–390 页。ACM,纽约。DOI 10.1145/342009.335432。第231章

J. Chen, D.J. DeWitt, F. Tian, and Y. Wang. 2000. NiagaraCQ: A scalable continuous query system for Internet databases. Proc. of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD ’00), pp. 379–390. ACM, New York. DOI 10.1145/342009.335432. 231

M. Cherniack、H. Balakrishnan、M. Balazinska、D. Carney、U. Çetintemel、Y. Xing 和 S. Zdonik。2003.可扩展的分布式流处理。过程。第一届双年度创新数据库系统会议( CIDR '03),加利福尼亚州阿西洛玛,一月。228

M. Cherniack, H. Balakrishnan, M. Balazinska, D. Carney, U. Çetintemel, Y. Xing, and S. Zdonik. 2003. Scalable distributed stream processing. Proc. of the First Biennial Conference on Innovative Database Systems (CIDR’03), Asilomar, CA, January. 228

CM克里斯滕森。1997.创新者的困境:当新技术导致大公司失败时。哈佛商学院出版社,马萨诸塞州波士顿。100

C. M. Christensen. 1997. The Innovator’s Dilemma: When New Technologies Cause Great Firms to Fail. Harvard Business School Press, Boston, MA. 100

X. Chu、IF Ilyas 和 P. Papotti。2013a. 全面数据清理:将违规行为纳入背景。过程。IEEE 国际数据工程会议,ICDE 2013,第 458–469 页。澳大利亚。DOI: 10.1109/ICDE.2013.6544847 270, 297

X. Chu, I. F. Ilyas, and P. Papotti. 2013a. Holistic data cleaning: Putting violations into context. Proc. of the IEEE International Conference on Data Engineering, ICDE 2013, pp. 458–469. Australia. DOI: 10.1109/ICDE.2013.6544847 270, 297

X. Chu、IF Ilyas 和 P. Papotti。2013b. 发现拒绝约束。过程。VLDB 捐赠基金,PVLDB 6(13): 1498–1509。DOI:10.14778/2536258.2536262。270

X. Chu, I. F. Ilyas, and P. Papotti. 2013b. Discovering denial constraints. Proc. of the VLDB Endowment, PVLDB 6(13): 1498–1509. DOI: 10.14778/2536258.2536262. 270

X. Chu、J. Morcos、IF Ilyas、M. Ouzzani、P. Papotti、N. Tang 和 Y. Ye。2015. Katara:由知识库和众包支持的数据清理系统。在过程中。2015 年 ACM SIGMOD 国际数据管理会议 (SIGMOD '15),第 1247–1261 页。ACM,纽约。DOI:10.1145/2723372.2749431。第297章

X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. 2015. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In Proc. of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD ’15), pp. 1247–1261. ACM, New York. DOI: 10.1145/2723372.2749431. 297

PJA Cock、CJ Fields、N. Goto、ML Heuer 和 PM Rice。2009。具有质量分数的序列的 Sanger FASTQ 文件格式以及 Solexa/Illumina FASTQ 变体。核酸研究38.6:1767–1771。DOI:10.1093/nar/gkp1137。第374章

P. J. A. Cock, C. J. Fields, N. Goto, M. L. Heuer, and P. M. Rice. 2009. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Research 38.6: 1767–1771. DOI: 10.1093/nar/gkp1137. 374

EF科德。1970 年 6 月。大型共享数据库的数据关系模型。ACM 通讯,13(6):377–387。DOI:10.1145/362384.362685。42、98、166、397、404、405、407

E. F. Codd. June 1970. A relational model of data for large shared data banks. Communications of the ACM, 13(6): 377–387. DOI: 10.1145/362384.362685. 42, 98, 166, 397, 404, 405, 407

M·柯林斯。2016 年。汤森路透使用 Tamr 以传统方法的一小部分时间和成本提供更好的互联内容。Tamr 博客,7 月 28 日。https://www.tamr.com/video/thomson-reuters-uses-tamr-deliver-better-connected-content-fraction-time-cost-legacy-approaches/。最后访问时间:2018 年 1 月 24 日。275

M. Collins. 2016. Thomson Reuters uses Tamr to deliver better connected content at a fraction of the time and cost of legacy approaches. Tamr blog, July 28. https://www.tamr.com/video/thomson-reuters-uses-tamr-deliver-better-connected-content-fraction-time-cost-legacy-approaches/. Last accessed January 24, 2018. 275

G. 科普兰和 D. 迈尔。1984. 将smalltalk 打造为数据库系统。过程。1984 年 ACM SIGMOD 国际数据管理会议 (SIGMOD '84),第 316–325 页。ACM,纽约。DOI:10.1145/602259.602300。111

G. Copeland and D. Maier. 1984. Making smalltalk a database system. Proc. of the 1984 ACM SIGMOD International Conference on Management of Data (SIGMOD ’84), pp. 316–325. ACM, New York. DOI: 10.1145/602259.602300. 111

C. Cranor、T. Johnson、V. Shkapenyuk 和 O. Spatscheck。2003. Gigascope:网络应用程序的流数据库。过程。2003 年 ACM SIGMOD 国际数据管理会议 (SIGMOD '03),第 647–651 页。ACM,纽约。DOI:10.1145/872757.872838。第231章

C. Cranor, T. Johnson, V. Shkapenyuk, and O. Spatscheck. 2003. Gigascope: A stream database for network applications. Proc. of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD ’03), pp. 647–651. ACM, New York. DOI:10.1145/872757.872838. 231

A. Crotty、A. Galakatos、K. Dursun、T. Kraska、U. Çetintemel 和 S. Zdonik。2015.Tupleware:“大数据、大分析、小集群。” CIDR。DOI:10.1.1.696.32。第374章

A. Crotty, A. Galakatos, K. Dursun, T. Kraska, U. Çetintemel, and S. Zdonik. 2015. Tupleware: “Big Data, Big Analytics, Small Clusters.” CIDR. DOI: 10.1.1.696.32. 374

M. Dallachiesa、A. Ebaid、A. Eldawi、A. Elmagarmid、IF Ilyas、M. Ouzzani 和 N. Tang。2013.NADEEF,商品数据清理系统。过程。2013 年 ACM SIGMOD 数据管理会议,第 541-552 页。纽约。http://dx.doi.org/10.1145/2463676.2465327。270, 297

M. Dallachiesa, A. Ebaid, A. Eldawi, A. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. 2013. NADEEF, a commodity data cleaning system. Proc. of the 2013 ACM SIGMOD Conference on Management of Data, pp. 541–552. New York. http://dx.doi.org/10.1145/2463676.2465327. 270, 297

T. Dasu 和 JM Loh。2012。统计失真:数据清理的后果。PVLDB,5(11):1674–1683。DOI:10.14778/2350229.2350279。第297章

T. Dasu and J. M. Loh. 2012. Statistical distortion: Consequences of data cleaning. PVLDB, 5(11): 1674–1683. DOI: 10.14778/2350229.2350279. 297

CJ Date 和 EF Codd。1975。关系和网络方法:应用程序编程接口的比较。在过程中。1974 年 ACM SIGFIDET(现在的 SIGMOD)数据描述、访问和控制研讨会:数据模型:数据结构集与关系(SIGFIDET '74),第 83-113 页。ACM,纽约。DOI:10.1145/800297.811534。405

C. J. Date and E. F. Codd. 1975. The relational and network approaches: Comparison of the application programming interfaces. In Proc. of the 1974 ACM SIGFIDET (now SIGMOD) Workshop on Data Description, Access and Control: Data Models: Data-Structure-Set Versus Relational (SIGFIDET ’74), pp. 83–113. ACM, New York. DOI: 10.1145/800297.811534. 405

DJ 德威特。1979a. 指导多处理器组织支持关系数据库管理系统。IEEE 计算机汇刊,28(6), 395–406。DOI:10.1109/TC.1979.1675379。109

D. J. DeWitt. 1979a. Direct a multiprocessor organization for supporting relational database management systems. IEEE Transactions of Computers, 28(6), 395–406. DOI: 10.1109/TC.1979.1675379. 109

DJ 德威特。1979b. 直接执行查询。在过程中。1979 年 ACM SIGMOD 国际数据管理会议 (SIGMOD '79),第 13-22 页。ACM,纽约。DOI:10.1145/582095.582098。109

D. J. DeWitt. 1979b. Query execution in DIRECT. In Proc. of the 1979 ACM SIGMOD International Conference on Management of Data (SIGMOD ’79), pp. 13–22. ACM, New York. DOI: 10.1145/582095.582098. 109

DJ DeWitt、RH Gerber、G. Graefe、ML Heytens、KB Kumar 和 M. Muralikrishna。1986. GAMMA——高性能数据流数据库机器。过程。第 12 届超大型数据库国际会议 (VLDB '86),WW Chu、G. Gardarin、S. Ohsuga 和 Y. Kambayashi,编辑,第 228-237 页。摩根考夫曼出版公司,旧金山,加利福尼亚州。111

D. J. DeWitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. B. Kumar, and M. Muralikrishna. 1986. GAMMA—a high performance dataflow database machine. Proc. of the 12th International Conference on Very Large Data Bases (VLDB ’86), W. W. Chu, G. Gardarin, S. Ohsuga, and Y. Kambayashi, editors, pp. 228–237. Morgan Kaufmann Publishers Inc., San Francisco, CA. 111

DJ DeWitt、S. Ghandeharizadeh、DA Schneider、A. Bricker、H.-I。萧和R.拉斯穆森。1990 年 3 月。Gamma 数据库机项目。IEEE 知识与数据工程汇刊,2(1):44–62。DOI:10.1109/69.50905。151, 400

D.J. DeWitt, S. Ghandeharizadeh, D. A. Schneider, A. Bricker, H.-I. Hsiao, and R. Rasmussen. March 1990. The Gamma database machine project. IEEE Transactions on Knowledge and Data Engineering, 2(1): 44–62. DOI: 10.1109/69.50905. 151, 400

D.德威特和J.格雷。1992 年 6 月。并行数据库系统:高性能数据库系统的未来。ACM 通讯,35(6):85-98。DOI:10.1145/129888.129894。199

D. DeWitt and J. Gray. June 1992. Parallel database systems: the future of high performance database systems. Communications of the ACM, 35(6): 85–98. DOI: 10.1145/129888.129894. 199

DJ DeWitt、A. Halverson、R. Nehme、S. Shankar、J. Aguilar-Saborit、A. Avanes、M. Flasza 和 J. Gramling。2013. Polybase 中的拆分查询处理。过程。2013 年 ACM SIGMOD 国际数据管理会议 (SIGMOD '13),第 1255–1266 页。ACM,纽约。第284章

D. J. DeWitt, A. Halverson, R. Nehme, S. Shankar, J. Aguilar-Saborit, A. Avanes, M. Flasza, and J. Gramling. 2013. Split query processing in polybase. Proc. of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD ’13), pp. 1255–1266. ACM, New York. 284

C. Diaconu、C. Freedman、E. Ismert,宾夕法尼亚州。拉尔森、P. 米塔尔、R. 斯通西弗、N. 维尔马和 M. Zwilling。2013.Hekaton:SQL Server 的内存优化的 OLTP 引擎。在过程中。2013 年 ACM SIGMOD 国际数据管理会议 (SIGMOD '13),第 1243–1254 页。ACM,纽约。DOI:10.1145/2463676.2463710。

C. Diaconu, C. Freedman, E. Ismert, P-A. Larson, P. Mittal, R. Stonecipher, N. Verma, and M. Zwilling. 2013. Hekaton: SQL server’s memory-optimized OLTP engine. In Proc. of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD ’13), pp. 1243–1254. ACM, New York. DOI: 10.1145/2463676.2463710.

KP Eswaran、JN Gray、RA Lorie 和 IL Traiger。1976 年 11 月。数据库系统中的一致性和谓词锁的概念。ACM 通讯,19(11):624–633。DOI:10.1145/360363.360369。114

K. P. Eswaran, J. N. Gray, R. A. Lorie, and I. L. Traiger. November 1976. The notions of consistency and predicate locks in a database system. Communications of the ACM, 19(11): 624–633. DOI: 10.1145/360363.360369. 114

W. Fan、J. Li、S. Ma、N. Tang 和 W. Yu。2012 年 4 月。针对编辑规则和主数据进行了某些修复。VLDB 杂志,21(2):213–238。DOI:10.1007/s00778-011-0253-7。第297章

W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. April 2012. Towards certain fixes with editing rules and master data. The VLDB Journal, 21(2): 213–238. DOI: 10.1007/s00778-011-0253-7. 297

D·福格。1982 年 9 月。在关系数据库系统 INGRES 中实现域抽象。加州大学伯克利分校电气工程与计算机科学系科学硕士报告。201

D. Fogg. September 1982. Implementation of domain abstraction in the relational database system INGRES. Master of Science Report, Dept. of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA. 201

T. 弗洛里、A. 罗宾和 M. 大卫。1988 年 5 月。使用关系数据库管理系统创建 SIPP 纵向分析文件。CDE 工作文件第 88-32 号,威斯康星大学麦迪逊分校贫困研究所,威斯康星州麦迪逊。197

T. Flory, A. Robbin, and M. David. May 1988. Creating SIPP longitudinal analysis files using a relational database management system. CDE Working Paper No. 88-32, Institute for Research on Poverty, University of Wisconsin-Madison, Madison, WI. 197

V. 加德帕利、J. 凯普纳、W. 阿坎德、D. 贝斯特、B. 伯杰龙、C. Byun、L. 爱德华兹、M. 哈贝尔、P. 米歇尔、J. 马伦、A. 普劳特、A. 罗莎、C.是的,A.鲁瑟。2015.D4M:将关联数组引入数据库引擎。高性能极限计算会议(HPEC)。IEEE,2015。DOI:10.1109/HPEC.2015.7322472。370

V. Gadepally, J. Kepner, W. Arcand, D. Bestor, B. Bergeron, C. Byun, L. Edwards, M. Hubbell, P. Michaleas, J. Mullen, A. Prout, A. Rosa, C. Yee, and A. Reuther. 2015. D4M: Bringing associative arrays to database engines. High Performance Extreme Computing Conference (HPEC). IEEE, 2015. DOI: 10.1109/HPEC.2015.7322472. 370

V. Gadepally、K. O'Brien、A. Dziedzic、A. Elmore、J. Kepner、S. Madden、T. Mattson、J. Rogers、Z. She 和 M. Stonebraker。2017 年 9 月。BigDAWG 版本 0.1。IEEE 高性能极限。DOI:10.1109/HPEC.2017.8091077。288, 369

V. Gadepally, K. O’Brien, A. Dziedzic, A. Elmore, J. Kepner, S. Madden, T. Mattson, J. Rogers, Z. She, and M. Stonebraker. September 2017. BigDAWG Version 0.1. IEEE High Performance Extreme. DOI: 10.1109/HPEC.2017.8091077. 288, 369

J.甘茨和D.赖因塞尔。2013 年。2020 年的数字宇宙:大数据、更大的数字阴影和远东地区最大的增长——美国,IDC,2 月。5

J. Gantz and D. Reinsel. 2013. The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East—United States, IDC, February. 5

L. Gerhardt、CH Faham 和 Y. Yao。2015。利用 SciDB 加速科学分析。物理学杂志:会议系列,664(7)。268

L. Gerhardt, C. H. Faham, and Y. Yao. 2015. Accelerating scientific analysis with SciDB. Journal of Physics: Conference Series, 664(7). 268

B. 毕业生。2007 年。迈克尔·斯通布雷克的口述历史,转录。记录时间:2007 年 8 月 23 日。计算机历史博物馆,莫尔顿伯勒,新罕布什尔州。http://archive.computerhistory.org/resources/access/text/2012/12/102635858-05-01-acc.pdf。上次访问时间:2018 年 4 月 8 日。42, 43, 44, 98

B. Grad. 2007. Oral history of Michael Stonebraker, Transcription. Recorded: August 23, 2007. Computer History Museum, Moultonborough, NH. http://archive.computerhistory.org/resources/access/text/2012/12/102635858-05-01-acc.pdf. Last accessed April 8, 2018. 42, 43, 44, 98

A·格特曼。1984. R-trees:用于空间搜索的动态索引结构。在过程中。1984 年 ACM SIGMOD 国际数据管理会议 (SIGMOD '84),第 47-57 页。ACM,纽约。DOI:10.1145/602259.602266。205

A. Guttman. 1984. R-trees: a dynamic index structure for spatial searching. In Proc. of the 1984 ACM SIGMOD International Conference on Management of Data (SIGMOD ’84), pp. 47–57. ACM, New York. DOI: 10.1145/602259.602266. 205

LM Haas、JC Freytag、GM Lohman 和 H. Pirahesh。1989 年。starburst 中的可扩展查询处理。在过程中。1989 年 ACM SIGMOD 国际数据管理会议 (SIGMOD '89),第 377–388 页。ACM,纽约。DOI:10.1145/67544.66962。第399章

L. M. Haas, J. C. Freytag, G. M. Lohman, and H. Pirahesh. 1989. Extensible query processing in starburst. In Proc. of the 1989 ACM SIGMOD International Conference on Management of Data (SIGMOD ’89), pp. 377–388. ACM, New York. DOI: 10.1145/67544.66962. 399

D. Halperin、V. Teixeira de Almeida、LL Choo、S. Chu、P. Koutris、D. Moritz、J. Ortiz、V. Ruamviboonsuk、J. Wang、A. Whitaker。2014年,Myria大数据管理服务演示。过程。2014年ACM SIGMOD国际会议 数据管理(SIGMOD '14),第 14 页。881–884。ACM,纽约。DOI:10.1145/2588555.2594530。284, 370

D. Halperin, V. Teixeira de Almeida, L. L. Choo, S. Chu, P. Koutris, D. Moritz, J. Ortiz, V. Ruamviboonsuk, J. Wang, A. Whitaker. 2014. Demonstration of the Myria big data management service. Proc. of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD ’14), p. 881–884. ACM, New York. DOI: 10.1145/2588555.2594530. 284, 370

B. Haynes、A. Cheung 和 M. Balazinska。2016. PipeGen:用于混合分析的数据管道生成器。过程。第七届 ACM 云计算研讨会 (SoCC '16),MK Aguilera、B. Cooper 和 Y. Diao,编辑,第 470–483 页。ACM,纽约。DOI:10.1145/2987550.2987567。第287章

B. Haynes, A. Cheung, and M. Balazinska. 2016. PipeGen: Data pipe generator for hybrid analytics. Proc. of the Seventh ACM Symposium on Cloud Computing (SoCC ’16), M. K. Aguilera, B. Cooper, and Y. Diao, editors, pp. 470–483. ACM, New York. DOI: 10.1145/2987550.2987567. 287

赫斯特妈妈。2009。搜索用户界面。剑桥大学出版社,纽约。第394章

M. A. Hearst. 2009. Search user interfaces. Cambridge University Press, New York. 394

JM Hellerstein、JF Naughton 和 A. Pfeffer。1995。数据库系统的通用搜索树。在过程中。第 21 届超大型数据库国际会议 (VLDB '95),第 562–573 页。摩根考夫曼出版公司,旧金山,加利福尼亚州。http://dl.acm.org/引用.cfm?id= 64592l.673l45。210

J. M. Hellerstein, J. F. Naughton, and A. Pfeffer. 1995. Generalized search trees for database systems. In Proc. of the 21th International Conference on Very Large Data Bases (VLDB ’95), pp. 562–573. Morgan Kaufmann Publishers Inc., San Francisco, CA. http://dl.acm.org/citation.cfm?id=64592l.673l45. 210

JM Hellerstein、E. Koutsoupias、DP Miranker、CH Papadimitriou、V. Samoladas。2002 年。关于可索引性模型及其范围查询的界限,ACM 杂志 (JACM),49.1:35-55。DOI:10.1145/505241.505244。210

J. M. Hellerstein, E. Koutsoupias, D. P. Miranker, C. H. Papadimitriou, V. Samoladas. 2002. On a model of indexability and its bounds for range queries, Journal of the ACM (JACM), 49.1: 35–55. DOI: 10.1145/505241.505244. 210

IBM。1997 年。IBM S/390 并行系统综合体集群特刊。IBM 系统杂志,36(2)。400

IBM. 1997. Special Issue on IBM’s S/390 Parallel Sysplex Cluster. IBM Systems Journal, 36(2). 400

S. Idreos、F. Groffen、N. Nes、S. Manegold、SK Mulender 和 ML Kersten。2012. MonetDB:面向列的数据库架构二十年的研究。IEEE 数据工程公报,35(1):40–45。258

S. Idreos, F. Groffen, N. Nes, S. Manegold, S. K. Mullender, and M. L. Kersten. 2012. MonetDB: two decades of research in column-oriented database architectures. IEEE Data Engineering Bulletin, 35(1): 40–45. 258

N. Jain、S. Mishra、A. Srinivasan、J. Gehrke、J. Widom、H. Balakrishnan、U. Çetintemel、M. Cherniack、R. Tibbetts 和 S. Zdonik。2008 年。迈向流式 SQL 标准。过程。VLDB 捐赠基金,第 1379–1390 页。8 月 1 日至 2 日。DOI:10.14778/1454159.1454179。229

N. Jain, S. Mishra, A. Srinivasan, J. Gehrke, J. Widom, H. Balakrishnan, U. Çetintemel, M. Cherniack, R. Tibbetts, and S. Zdonik. 2008. Towards a streaming SQL standard. Proc. VLDB Endowment, pp. 1379–1390. August 1–2. DOI: 10.14778/1454159.1454179. 229

AEW Johnson、TJ Pollard、L. Shen、LH Lehman、M. Feng、M. Ghassemi、BE Moody、P. Szolovits、LAG Celi 和 RG Mark。2016。MIMIC-III,一个可免费访问的重症监护数据库。科学数据3:160035 DOI:10.1038/sdata.2016.35。370

A. E. W. Johnson, T. J. Pollard, L. Shen, L. H. Lehman, M. Feng, M. Ghassemi, B. E. Moody, P. Szolovits, L. A. G. Celi, and R. G. Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific Data 3: 160035 DOI: 10.1038/sdata.2016.35. 370

V. Josifovski、P. Schwarz、L. Haas 和 E. Lin。2002. Garlic:DB2 联合查询处理的新风格。在过程中。2002 年 ACM SIGMOD 国际数据管理会议 (SIGMOD '02),第 524–532 页。ACM,纽约。DOI:10.1145/564691.564751。401

V. Josifovski, P. Schwarz, L. Haas, and E. Lin. 2002. Garlic: a new flavor of federated query processing for DB2. In Proc. of the 2002 ACM SIGMOD International Conference on Management of Data (SIGMOD ’02), pp. 524–532. ACM, New York. DOI: 10.1145/564691.564751. 401

JW Josten、C. Mohan、I. Narang 和 JZ Teng。1997 年。DB2 使用耦合工具进行数据共享。IBM 系统杂志,36(2):327–351。DOI:10.1147/sj.362.0327。400

J. W. Josten, C. Mohan, I. Narang, and J. Z. Teng. 1997. DB2’s use of the coupling facility for data sharing. IBM Systems Journal, 36(2): 327–351. DOI: 10.1147/sj.362.0327. 400

S. Kandel、A. Paepcke、J. Hellerstein 和 J. Heer。2011.Wrangler:数据转换脚本的交互式可视化规范。在过程中。SIGCHI 计算系统中的人为因素会议 (CHI '11),第 3363–3372 页。ACM,纽约。DOI:10.1145/1978942.1979444。第297章

S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. 2011. Wrangler: Interactive visual specification of data transformation scripts. In Proc. of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’11), pp. 3363–3372. ACM, New York. DOI: 10.1145/1978942.1979444. 297

R·卡茨。编辑。1982 年 6 月。设计数据管理特刊。IEEE 数据库工程通讯,5(2)。200

R. Katz. editor. June 1982. Special issue on design data management. IEEE Database Engineering Newsletter, 5(2). 200

J. Kepner、V. Gadepally、D. Hutchison、H. Jensen、T. Mattson、S. Samsi 和 A. Reuther。2016. SQL、NoSQL 和 NewSQL 数据库的关联数组模型。IEEE 高级 性能极限计算会议 (HPEC) 2016,马萨诸塞州沃尔瑟姆,9 月 13 日至 15 日。DOI:10.1109/HPEC.2016.7761647。第289章

J. Kepner, V. Gadepally, D. Hutchison, H. Jensen, T. Mattson, S. Samsi, and A. Reuther. 2016. Associative array model of SQL, NoSQL, and NewSQL Databases. IEEE High Performance Extreme Computing Conference (HPEC) 2016, Waltham, MA, September 13–15. DOI: 10.1109/HPEC.2016.7761647. 289

V.凯文和M.惠特尼。1974.关系数据管理实现技术。在过程中。1974 年 ACM SIGFIDET(现为 SIGMOD)数据描述、访问和控制研讨会 (SIGFIDET '74),第 321–350 页。ACM,纽约。数字号码:10.1145/800296.811519 404

V. Kevin and M. Whitney. 1974. Relational data management implementation techniques. In Proc. of the 1974 ACM SIGFIDET (now SIGMOD) Workshop on Data Description, Access and Control (SIGFIDET ’74), pp. 321–350. ACM, New York. DOI: 10.1145/800296.811519 404

Z. Khayyat、IF Ilyas、A. Jindal、S. Madden、M. Ouzzani、P. Papotti、J.-A。Quiané-Ruiz、N. Tang 和 S. Yin。2015.Bigdansing:大数据清理系统。在过程中。2015 年 ACM SIGMOD 国际数据管理会议 (SIGMOD '15),第 1215–1230 页。ACM,纽约。DOI:10.1145/2723372.2747646。第297章

Z. Khayyat, I.F. Ilyas, A. Jindal, S. Madden, M. Ouzzani, P. Papotti, J.-A. Quiané-Ruiz, N. Tang, and S. Yin. 2015. Bigdansing: A system for big data cleansing. In Proc. of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD ’15), pp. 1215–1230. ACM, New York. DOI: 10.1145/2723372.2747646. 297

R.金博尔和M.罗斯。2013。数据仓库工具包。约翰威利父子公司https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/。最后访问时间:2018 年 3 月 2 日。 337

R. Kimball and M. Ross. 2013. The Data Warehouse Toolkit. John Wiley & Sons, Inc. https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/. Last accessed March 2, 2018. 337

M. Kornacker、C. Mohan 和 JM Hellerstein。1997。广义搜索树中的并发性和恢复。在过程中。1997 年 ACM SIGMOD 国际数据管理会议 (SIGMOD '97),第 62–72 页。ACM,纽约。DOI:10.1145/253260.253272。210

M. Kornacker, C. Mohan, and J.M. Hellerstein. 1997. Concurrency and recovery in generalized search trees. In Proc. of the 1997 ACM SIGMOD International Conference on Management of Data (SIGMOD ’97), pp. 62–72. ACM, New York. DOI: 10.1145/253260.253272. 210

A. Lamb、M. Fuller、R. Varadarajan、N. Tran、B. Vandiver、L. Doshi 和 C. Bear。2012 年 8 月。Vertica 分析数据库:7 年后的便利店。过程。VLDB 捐赠,5(12):1790–1801。DOI:10.14778/2367502.2367518。333, 336

A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandiver, L. Doshi, and C. Bear. August 2012. The Vertica Analytic Database: C-Store 7 years later. Proc. VLDB Endowment, 5(12): 1790–1801. DOI: 10.14778/2367502.2367518. 333, 336

L·兰波特。2001. Paxos 变得简单。http://lamport.azurewebsites.net/pubs/paxos-simple.pdf。最后访问时间:2017 年 12 月 31 日。258

L. Lamport. 2001. Paxos Made Simple. http://lamport.azurewebsites.net/pubs/paxos-simple.pdf. Last accessed December 31, 2017. 258

D·莱尼。2001. 3D 数据管理:控制数据量、种类和速度。META 集团研究,2 月 6 日。https ://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf。最后访问时间:2018 年 4 月 22 日。 357

D. Laney. 2001. 3D data management: controlling data volume, variety and velocity. META Group Research, February 6. https://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf. Last accessed April 22, 2018. 357

宾夕法尼亚州。Larson、C. Clinciu、EN Hanson、A. Oks、SL Price、S. Rangarajan、A. Surna 和 Q. Zhou。2011。SQL Server 列存储索引。2011 年 ACM SIGMOD 国际数据管理会议 (SIGMOD '11) 会议记录,第 1177–1184 页。ACM,纽约。DOI:10.1145/1989323.1989448。

P-A. Larson, C. Clinciu, E.N. Hanson, A. Oks, S.L. Price, S. Rangarajan, A. Surna, and Q. Zhou. 2011. SQL server column store indexes. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD ’11), pp. 1177–1184. ACM, New York. DOI: 10.1145/1989323.1989448.

J. LeFevre、J. Sankaranarayanan、H. Hacigumus、J. Tatemura、N. Polyzotis 和 MJ Carey。2014. MISO:使用多商店系统增强大数据查询处理。过程。2014 年 ACM SIGMOD 国际数据管理会议 (SIGMOD '14 ),第 1591–1602 页。ACM,纽约。DOI:10.1145/2588555.2588568。第284章

J. LeFevre, J. Sankaranarayanan, H. Hacigumus, J. Tatemura, N. Polyzotis, and M. J. Carey. 2014. MISO: Souping up big data query processing with a multistore system. Proc. of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD ’14), pp. 1591–1602. ACM, New York. DOI: 10.1145/2588555.2588568. 284

BG 林赛。1987 年。R* 回顾:分布式数据库管理系统。在过程中。IEEE,75(5):668–673。DOI:10.1109/PROC.1987.13780。400

B. G. Lindsay. 1987. A retrospective of R*: a distributed database management system. In Proc. of the IEEE, 75(5): 668–673. DOI: 10.1109/PROC.1987.13780. 400

B. Liskov 和 SN Zilles。1974 年。使用抽象数据类型进行编程。SIGPLAN 通知,9(4):50–59。DOI:10.1145/942572.807045。88

B. Liskov and S.N. Zilles. 1974. Programming with abstract data types. SIGPLAN Notices, 9(4): 50–59. DOI: 10.1145/942572.807045. 88

S. Marcin 和 A. Csillaghy。2016。作为数组数据库运算符运行科学算法:为数据带来处理能力。2016年IEEE大数据国际会议。第 3187–3193 页。DOI:10.1109/BigData.2016.7840974。350

S. Marcin and A. Csillaghy. 2016. Running scientific algorithms as array database operators: Bringing the processing power to the data. 2016 IEEE International Conference on Big Data. pp. 3187–3193. DOI: 10.1109/BigData.2016.7840974. 350

T. 马特森、V. 加德帕利、Z. She、A. Dziedzic 和 J. Parkhurst。2017. 演示用于海洋宏基因组分析的 BigDAWG polystore 系统。CIDR'17查米纳德,加利福尼亚州。http://cidrdb.org/cidr2017/papers/p120-mattson-cidr17.pdf。288, 374

T. Mattson, V. Gadepally, Z. She, A. Dziedzic, and J. Parkhurst. 2017. Demonstrating the BigDAWG polystore system for ocean metagenomic analysis. CIDR’17 Chaminade, CA. http://cidrdb.org/cidr2017/papers/p120-mattson-cidr17.pdf. 288, 374

J. Meehan、C. Aslantas、S. Zdonik、N. Tatbul 和 J. Du。2017 年。互联世界的数据摄取。创新数据系统研究会议 (CIDR'17),加利福尼亚州查米纳德,一月。第376章

J. Meehan, C. Aslantas, S. Zdonik, N. Tatbul, and J. Du. 2017. Data ingestion for the connected world. Conference on Innovative Data Systems Research (CIDR’17), Chaminade, CA, January. 376

A. Metaxides、WB Helgeson、RE Seth、GC Bryson、MA Coane、DG Dodd、CP Earnest、RW Engles、LN Harper、PA Hartley、DJ Hopkin、JD Joyce、SC Knapp、JR Lucking、JM Muro、MP Persily、MA拉姆、JF 拉塞尔、RF 舒伯特、JR 西德洛、MM 史密斯和 GT 维尔纳。1971 年 4 月。数据库任务组向 CODASYL 编程语言委员会提交的报告。ACM,纽约。43

A. Metaxides, W. B. Helgeson, R. E. Seth, G. C. Bryson, M. A. Coane, D. G. Dodd, C. P. Earnest, R. W. Engles, L. N. Harper, P. A. Hartley, D. J. Hopkin, J. D. Joyce, S. C. Knapp, J. R. Lucking, J. M. Muro, M. P. Persily, M. A. Ramm, J. F. Russell, R. F. Schubert, J. R. Sidlo, M. M. Smith, and G. T. Werner. April 1971. Data Base Task Group Report to the CODASYL Programming Language Committee. ACM, New York. 43

C. Mohan、D. Haderle、B. Lindsay、H. Pirahesh 和 P. Schwarz。1992. ARIES:一种事务恢复方法,支持使用预写日志记录的细粒度锁定和部分回滚。ACM 数据库系统交易,17(1), 94–162。DOI:10.1145/128765.128770。第402章

C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and P. Schwarz. 1992. ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. ACM Transactions on Database Systems, 17(1), 94–162. DOI: 10.1145/128765.128770. 402

R. Motwani、J. Widom、A. Arasu B. Babcock、S. Babu、M. Datar、G. Manku、C. Olston、J. Rosenstein 和 R. Varma。2003.数据流管理系统中的查询处理、近似和资源管理。过程。第一届创新数据系统研究双年度会议 (CIDR),一月。229, 231

R. Motwani, J. Widom, A. Arasu B. Babcock, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein, and R. Varma. 2003. Query processing, approximation, and resource management in a data stream management system. Proc. of the First Biennial Conference on Innovative Data Systems Research (CIDR), January. 229, 231

A. Oloso、KS Kuo、T. Clune、P. Brown、A. Poliakov、H. Yu。2016. 将连接组件标记实现为 SciDB 的用户定义运算符。过程。2016年IEEE大数据国际会议(Big Data)。华盛顿特区。DOI:10.1109/BigData.2016.7840945。263, 350

A. Oloso, K-S Kuo, T. Clune, P. Brown, A. Poliakov, H. Yu. 2016. Implementing connected component labeling as a user defined operator for SciDB. Proc. of 2016 IEEE International Conference on Big Data (Big Data). Washington, DC. DOI: 10.1109/BigData.2016.7840945. 263, 350

马奥尔森。1993. 倒置文件系统的设计与实现。USENIX冬天。http://www.usenix.org/conference/usenix-winter-1993-conference/presentation/design-and-implementation-inversion-file-syste。最后访问时间:2018 年 1 月 22 日。215

M. A. Olson. 1993. The design and implementation of the inversion file system. USENIX Winter. http://www.usenix.org/conference/usenix-winter-1993-conference/presentation/design-and-implementation-inversion-file-syste. Last accessed January 22, 2018. 215

JC翁. 1982. 关系数据库系统中抽象数据类型的实现 INGRES,科学硕士报告,加州大学伯克利分校电气工程和计算机科学系,1982 年 9 月。201

J. C. Ong. 1982. Implementation of abstract data types in the relational database system INGRES, Master of Science Report, Dept. of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, September 1982. 201

A·帕尔默。2013 年。文化很重要:Facebook CIO 谈论 Vertica 与 Facebook 人员的融合程度。Koa Labs 博客,12 月 20 日。http ://koablog.wordpress.com/2013/12/20/culture-matters-facebook-cio-talks-about-how-well-vertica-facebook-people-mesh。最后访问时间:2018 年 3 月 14 日。 132, 133

A. Palmer. 2013. Culture matters: Facebook CIO talks about how well Vertica, Facebook people mesh. Koa Labs Blog, December 20. http://koablog.wordpress.com/2013/12/20/culture-matters-facebook-cio-talks-about-how-well-vertica-facebook-people-mesh. Last accessed March 14, 2018. 132, 133

A·帕尔默。2015a. 简单的道理:快乐的人,健康的公司。Tamr 博客,3 月 23 日。http://www.tamr.com/the-simple-truth-happy-people-healthy-company/。最后访问时间:2018 年 3 月 14 日。 138

A. Palmer. 2015a. The simple truth: happy people, healthy company. Tamr Blog, March 23. http://www.tamr.com/the-simple-truth-happy-people-healthy-company/. Last accessed March 14, 2018. 138

A·帕尔默。2015b. 红皮书与独角兽相遇的地方,Xconomy,6 月 22 日。http://www.xconomy.com/boston/2015/06/22/where-the-red-book-meets-the-unicorn/最后访问时间:3 月 14 日, 2018.130

A. Palmer. 2015b. Where the red book meets the unicorn, Xconomy, June 22. http://www.xconomy.com/boston/2015/06/22/where-the-red-book-meets-the-unicorn/ Last accessed March 14, 2018. 130

A. 帕夫洛和 M. 阿斯莱特。2016 年 9 月。NewSQL 有哪些真正的新功能?ACM SIGMOD 记录,45(2):45–55。DOI:DOI:10.1145/3003665.3003674。246

A. Pavlo and M. Aslett. September 2016. What’s really new with NewSQL? ACM SIGMOD Record, 45(2): 45–55. DOI: DOI: 10.1145/3003665.3003674. 246

G.出版社。2016 年。调查显示,清理大数据:最耗时、最不愉快的数据科学任务。福布斯,5 月 23 日。https ://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/# 79e14e326f63。第357章

G. Press. 2016. Cleaning big data: most time-consuming, least enjoyable data science task, survey says. Forbes, May 23. https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#79e14e326f63. 357

N.Prokoshyna、J.Szlichta、F.Chiang、RJ Miller 和 D.Srivastava。2015.定量与逻辑数据清洗相结合。PVLDB,9(4):300–311。DOI:10.14778/2856318.2856325。第297章

N. Prokoshyna, J. Szlichta, F. Chiang, R. J. Miller, and D. Srivastava. 2015. Combining quantitative and logical data cleaning. PVLDB, 9(4): 300–311. DOI: 10.14778/2856318.2856325. 297

E. Ryvkina、AS Maskey、M. Cherniack 和 S. Zdonik。2006 年。流处理引擎中的修订处理:高级设计。过程。第 22 届国际数据工程会议(ICDE'06),第 141 页–。佐治亚州亚特兰大,四月。IEEE 计算机协会,华盛顿特区。DOI:10.1109/ICDE.2006.130。228

E. Ryvkina, A. S. Maskey, M. Cherniack, and S. Zdonik. 2006. Revision processing in a stream processing engine: a high-level design. Proc. of the 22nd International Conference on Data Engineering (ICDE’06), pp. 141–. Atlanta, GA, April. IEEE Computer Society, Washington, DC. DOI: 10.1109/ICDE.2006.130. 228

C.萨拉科和D.哈德勒。2013 年。IBM DB2 的历史和发展。IEEE 计算史年鉴,35(2):54–66。DOI:10.1109/MAHC.2012.55。第398章

C. Saracco and D. Haderle. 2013. The history and growth of IBM’s DB2. IEEE Annals of the History of Computing, 35(2): 54–66. DOI: 10.1109/MAHC.2012.55. 398

N.萨维奇。2015 年 5 月。建立关系。ACM 通讯,58(6):22-23。DOI:10.1145/2754956。

N. Savage. May 2015. Forging relationships. Communications of the ACM, 58(6): 22–23. DOI: 10.1145/2754956.

MC Schatz 和 B. Langmead。2013 年。DNA 数据洪流。IEEE 频谱杂志https://spectrum.ieee.org/biomedical/devices/the-dna-data-deluge。第354章

M. C. Schatz and B. Langmead. 2013. The DNA data deluge. IEEE Spectrum Magazine. https://spectrum.ieee.org/biomedical/devices/the-dna-data-deluge. 354

Z. She、S. Ravishankar 和 J. Duggan。2016. BigDAWG 通过语义等价进行 Polystore 查询优化。高性能极限计算会议(HPEC)。IEEE,2016。DOI::10.1109/HPEC.2016.7761584。第373章

Z. She, S. Ravishankar, and J. Duggan. 2016. BigDAWG polystore query optimization through semantic equivalences. High Performance Extreme Computing Conference (HPEC). IEEE, 2016. DOI: :10.1109/HPEC.2016.7761584. 373

SIGFIDET 小组讨论。1974。在Proc。1974 年 ACM SIGFIDET(现为 SIGMOD)数据描述、访问和控制研讨会:数据模型:数据结构集与关系(SIGFIDET '74),第 121–144 页。ACM,纽约。DOI:10.1145/800297.811534。404

SIGFIDET panel discussion. 1974. In Proc. of the 1974 ACM SIGFIDET (now SIGMOD) Workshop on Data Description, Access and Control: Data Models: Data-Structure-Set Versus Relational (SIGFIDET ’74), pp. 121–144. ACM, New York. DOI: 10.1145/800297.811534. 404

R.斯诺德格拉斯。1982 年 12 月。监控分布式系统:关系方法。博士 论文,计算机科学系,卡内基梅隆大学,匹兹堡,宾夕法尼亚州,197

R. Snodgrass. December 1982. Monitoring distributed systems: a relational approach. Ph.D. Dissertation, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, 197

A.斯扎莱。2008 年 6 月。斯隆数字巡天及其他。ACM SIGMOD 记录,37(2):61–66。255

A. Szalay. June 2008. The Sloan digital sky survey and beyond. ACM SIGMOD Record, 37(2): 61–66. 255

塔姆尔。2017. Tamr 获得企业级数据统一系统专利。塔姆尔博客。2017 年 2 月 9 日。https ://www.tamr.com/tamr-awarded-patent-enterprise-scale-data-unification-system-2/。最后访问时间:2018 年 1 月 24 日。275

Tamr. 2017. Tamr awarded patent for enterprise-scale data unification system. Tamr blog. February 9 2017. https://www.tamr.com/tamr-awarded-patent-enterprise-scale-data-unification-system-2/. Last accessed January 24, 2018. 275

R. Tan、R. Chirkova、V. Gadepally 和 T. Mattson。2017。支持跨异构数据模型的查询处理:一项调查。IEEE 大数据研讨会:管理异构大数据和 Polystore 数据库的方法,马萨诸塞州波士顿。DOI:10.1109/BigData.2017.8258302。284, 376

R. Tan, R. Chirkova, V. Gadepally, and T. Mattson. 2017. Enabling query processing across heterogeneous data models: A survey. IEEE Big Data Workshop: Methods to Manage Heterogeneous Big Data and Polystore Databases, Boston, MA. DOI: 10.1109/BigData.2017.8258302. 284, 376

N.塔特布尔和S.兹多尼克。2006. 用于数据流聚合查询的窗口感知负载卸载。在过程中。第 32 届超大型数据库国际会议 (VLDB'06),韩国首尔。228, 229

N. Tatbul and S. Zdonik. 2006. Window-aware Load Shedding for Aggregation Queries over Data Streams. In Proc. of the 32nd International Conference on Very Large Databases (VLDB’06), Seoul, Korea. 228, 229

N. Tatbul、U. Çetintemel 和 S. Zdonik。2007 年。“保持健康:分布式流处理的高效减载技术。” 超大型数据库国际会议 (VLDB'07),奥地利维也纳。228, 229

N. Tatbul, U. Çetintemel, and S. Zdonik. 2007. “Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing.” International Conference on Very Large Data Bases (VLDB’07), Vienna, Austria. 228, 229

RP范德里特。1986。专家数据库系统。在未来一代计算机系统中,2(3):191–199,DOI:10.1016/0167-739X(86)90015-4。407

R. P. van de Riet. 1986. Expert database systems. In Future Generation Computer Systems, 2(3): 191–199, DOI: 10.1016/0167-739X(86)90015-4. 407

M. Vartak、S. Rahman、S. Madden、A. Parameswaran 和 N. Polyzotis。2015 年 9 月。Seedb:支持可视化分析的高效数据驱动可视化建议。PVLDB,8(13):2182–2193。DOI:10.14778/2831360.2831371。第297章

M. Vartak, S. Rahman, S. Madden, A. Parameswaran, and N. Polyzotis. September 2015. Seedb: Efficient data-driven visualization recommendations to support visual analytics. PVLDB, 8(13): 2182–2193. DOI: 10.14778/2831360.2831371. 297

B.华莱士。1986 年 6 月 9 日。数据库工具链接到远程站点。网络世界http://books.google.com/books?id=aBwEAAAAMBAJ&pg=PA49&lpg=PA49&dq=ingres+star&source=bl&ots=FSMIR4thMj&sig=S1fzaaOT5CHRq4cwbLFEQp4UYCs&hl=en&sa=X&ved=0ahUKEwjJ1J_NttvZAhUG82MKHco2CfAQ 6AEIYzAP# v=onepage&q=ingres%20star&f =false 。最后访问时间:2018 年 3 月 14 日.305

B. Wallace. June 9, 1986. Data base tool links to remote sites. Network World. http://books.google.com/books?id=aBwEAAAAMBAJ&pg=PA49&lpg=PA49&dq=ingres+star&source=bl&ots=FSMIR4thMj&sig=S1fzaaOT5CHRq4cwbLFEQp4UYCs&hl=en&sa=X&ved=0ahUKEwjJ1J_NttvZAhUG82MKHco2CfAQ6AEIYzAP#v=onepage&q=ingres%20star&f=false. Last accessed March 14, 2018.305

J.王和NJ唐。2014.通过修复规则实现可靠的数据修复。在过程中。2014 年 ACM SIGMOD 国际数据管理会议 (SIGMOD '14),第 457–468 页。ACM,纽约。DOI:10.1145/2588555.2610494。第297章

J. Wang and N. J. Tang. 2014. Towards dependable data repairing with fixing rules. In Proc. of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD ’14), pp. 457–468. ACM, New York. DOI: 10.1145/2588555.2610494. 297

E. Wong 和 K. Youssefi。1976 年 9 月。分解——查询处理的策略。ACM 数据库系统交易,1(3):223–241。DOI:10.1145/320473.320479。196

E. Wong and K. Youssefi. September 1976. Decomposition—a strategy for query processing. ACM Transactions on Database Systems, 1(3): 223–241. DOI: 10.1145/320473.320479. 196

E.吴和S.马登。2013. Scorpion:解释聚合查询中的异常值。PVLDB,6(8):553–564。DOI:10.14778/2536354.2536356。第297章

E. Wu and S. Madden. 2013. Scorpion: Explaining away outliers in aggregate queries. PVLDB, 6(8): 553–564. DOI: 10.14778/2536354.2536356. 297

Y. Xing、S. Zdonik 和 J.-H。黄. 2005 年 4 月。Borealis 流处理器中的动态负载分配。过程。第 21 届国际数据工程会议( ICDE '05),日本东京。DOI:10.1109/ICDE.2005.53。228、230、325

Y. Xing, S. Zdonik, and J.-H. Hwang. April 2005. Dynamic load distribution in the Borealis Stream Processor. Proc. of the 21st International Conference on Data Engineering (ICDE’05), Tokyo, Japan. DOI: 10.1109/ICDE.2005.53. 228, 230, 325

指数

Index

本标题印刷版中出现的索引已被有意从电子书中删除。请使用电子阅读设备上的搜索功能来搜索感兴趣的术语。下面列出了打印索引中出现的术语,供您参考。

The index that appeared in the print version of this title was intentionally removed from the eBook. Please use the search function on your eReading device to search for terms of interest. For your reference, the terms that appear in the print index are listed below.

Stonebraker 50 年展望

50-year perspective of Stonebraker

1970年秋,密歇根大学

1970 fall, University of Michigan

1976 年秋天,威斯康星州

1976 fall, Wisconsin

1983 年秋天,伯克利

1983 fall, Berkeley

1988–1995

1988–1995

2000 年,红杉计划

2000, Project Sequoia

2003年,CIDR会议启动

2003, CIDR Conference launch

2005年,麻省理工学院休假

2005, MIT sabbatical

2008年,MapReduce

2008, MapReduce

2014年,图灵奖

2014, Turing Award

2016,麻省理工学院

2016, MIT

2017年,遇见

2017, encounter

1000 基因组浏览器

1000 Genomes Browser

百万退伍军人计划

1 Million Veterans program

阿巴迪,丹尼尔·J.

Abadi, Daniel J.

便利店项目透视文章

C-Store project perspective article

建筑时代的终结

end of architectural era

H店原型

H-Store prototype

OLTP数据库

OLTP databases

维蒂卡系统公司

Vertica Systems

抽象数据类型 (ADT)

Abstract data types (ADTs)

安格尔

Ingres

安格尔原型

Ingres prototype

Postgres

Postgres

“通过查询修改在关系数据库管理系统中进行访问控制”(Stonebraker 和 Wong)

“Access Control in a Relational Database Management System By Query Modification” (Stonebraker and Wong)

访问方式

Access methods

安格尔

Ingres

Postgres

Postgres

访问方法接口 (AMI)

Access Methods Interface (AMI)

ACM软件系统奖

ACM Software System Award

爱天企业

Actian enterprise

Postgres 中的活动数据库

Active databases in Postgres

OLTP 中的主动-被动复制

Active-passive replication in OLTP

阿达马克系统

Addamark system

Ingres 中的地址空间限制

Address space limitations in Ingres

安格尔的管理文件

Administration files in Ingres

管理系统

ADMINS system

ADT-安格尔

ADT-Ingres

初创公司的成人监督

Adult supervision for startup companies

先进的H店策略

Advanced H-Store strategy

Affero GPL 许可证

Affero GPL license

便利店中的聚合算子

Aggregation operators in C-Store

一种尺寸的聚合系统适合所有人

Aggregation systems in one size fits all

人工智能系统

AI systems

以安格尔为导向

Ingres-oriented

机器学习

machine learning

Tamr 中的算法复杂性

Algorithmic complexity in Tamr

埃里克·奥尔曼

Allman, Eric

VoltDB 中的分配器碎片

Allocator fragmentation in VoltDB

我们所有人计划

AllofUs program

H-Store 中的锚定元组

Anchor tuples in H-Store

便利店中的锚定投影

Anchored projections in C-Store

安东,杰夫

Anton, Jeff

米罗团队

Miro team

Postgres 生产化

Postgres productionization

蚂蚁数据库管理系统

AntsDBMS

Anurag、Maskey 谈 Aurora 项目

Anurag, Maskey, on Aurora project

青木,保罗

Aoki, Paul

Apache Hadoop 项目

Apache Hadoop project

批评

criticism of

开源影响

open source impact

阿帕奇HAWQ

Apache HAWQ

阿帕奇火花

Apache Spark

一刀切的应用逻辑

Application logic in one size fits all

Postgres 中面积大于 (AGT) 运算符

Area greater than (AGT) operator in Postgres

ARIES(利用语义的恢复和隔离算法)

ARIES (Algorithms for Recovery and isolation Exploiting Semantics)

亚利桑那州立大学

Arizona State University

SciDB 的数组函数语言 (AFL)

Array Functional Language (AFL) for SciDB

数组

Arrays

Postgres

Postgres

科学数据库

SciDB

AS/400平台

AS/400 platform

紫苑数据

Aster Data

创立

founding

Postgres 并行化

Postgres parallelization

音频数据管理

Audio data management

Aurora 代码线和流处理系统

Aurora codelines and stream processing systems

极光项目

Aurora project

创立

founding

起源

origins

研究故事

research story

流库基于

StreamBase based on

系统

systems

金系统

Aurum system

数据文明者

Data Civilizer

描述

description

Ingres 中的 AUX 目录

AUX directory in Ingres

可用性

Availability

联机事务处理设计

OLTP design

一种尺寸适合所有人

one size fits all

AVL树

AVL trees

AWS 红移

AWS Redshift

B 树和 B 树索引

B-trees and B-tree indexes

便利店

C-Store

商业安格尔代码线

commercial Ingres codeline

描述

description

联机事务处理

OLTP

和 Postgres

and Postgres

“重新审视 B 树”(Stonebraker 和 Held)

“B-Trees Re-examined” (Stonebraker and Held)

查尔斯·巴赫曼

Bachman, Charles

关系-CODASYL 争论

relational-CODASYL debate

图灵奖

Turing Award

彼得·巴利斯

Bailis, Peter

哈里·巴拉克里希南

Balakrishnan, Hari

便利店项目

C-Store project

流库

StreamBase

马格达莱纳·巴拉津斯卡

Balazinska, Magdalena

Aurora/Borealis/StreamBase 重聚

Aurora/Borealis/StreamBase reunion

北欧化工项目

Borealis project

流处理时代文章

stream processing era article

Bates-Haus、Nikolaus、Tamr 代码线文章

Bates-Haus, Nikolaus, Tamr codeline article

亚当·巴特金 (Batkin),便利店的开创性作品

Batkin, Adam, C-Store seminal work

战斗,莱拉尼

Battle, Leilani

熊、查克

Bear, Chuck

克里斯·博蒙特

Beaumont, Chris

埃德蒙·贝戈利

Begoli, Edmon

伯克利市法院系统

Berkeley Municipal Court system

伯克利软件分发 (BSD) 许可证

Berkeley Software Distribution (BSD) license

和安格尔

and Ingres

起源

origins

伯克利年

Berkeley years

1983年秋天

1983 fall

技术贡献

technical contributions

StreamBase 中的 BerkeleyDB

BerkeleyDB in StreamBase

理查德·伯曼

Berman, Richard

伯恩斯坦,菲利普 A.

Bernstein, Philip A.

关于领导和宣传

on leadership and advocacy

关系数据库博士

relational database Ph.D

保罗·贝鲁贝

Berube, Paul

乔治·贝斯卡莱斯

Beskales, George

数据驯服者项目

Data Tamer project

塔姆尔联合创始人

Tamr co-founder

塔姆尔公司

Tamr company

瑞安·贝茨

Betts, Ryan

百路驰公司

BFGoodrich company

自行车的故事

Bicycle story

阿纳科特斯

Anacortes

巴特尔湖

Battle Lake

卡林顿

Carrington

困难的部分

difficult part

德雷克

Drake

埃利科特维尔

Ellicottville

卢丁顿

Luddington

玛丽亚斯山口

Marias Pass

作为构建系统软件的隐喻

as metaphor for building system software

莫尔顿伯勒

Moultonborough

萨顿

Sutton

特洛伊

Troy

温思罗普

Winthrop

沃拉斯顿海滩

Wollaston Beach

大数据时代

Big Data era

特征

characteristics

和 Postgres

and Postgres

流处理中

stream processing in

数量、速度和多样性

volume, velocity, and variety

BigDAWG 代码线

BigDAWG codeline

未来

future

介绍

introduction

里程碑

milestones

起源

origins

公开示威

public demonstration

精制

refining

发布

release

BigDAWG Polystore 系统

BigDAWG polystore system

结论

conclusion

数据移动

data movement

描述

description

发展

development

国际标准技术委员会

ISTC

一种尺寸并不适合所有尺寸

one size does not fit all

起源

origins

的观点

perspective on

查询建模和优化

query modeling and optimization

发布和演示

releases and demos

史蒂夫·比勒

Biller, Steve

Ingres 中的 BIN 目录

BIN directory in Ingres

生物样本库计划

Biobank program

生物信息学市场

Bioinformatics market

C-Store 中的比特串

Bitstrings in C-Store

VoltDB 中的 Blob 存储

Blob storage in VoltDB

阻塞

Blocking

德米特里·博奇科夫

Bochkov, Dmitry

玻尔巴格

Bohrbugs

Borealis 流处理系统代码线

Borealis codelines for stream processing systems

北欧化工项目

Borealis project

Aurora项目搬迁至

Aurora project move to

起源

origins

系统

systems

瓶颈研究

Bottleneck studies

鲍斯,阿里,米罗团队

Bowes, Ari, on Miro team

博伊斯、布鲁斯

Boyce, Bruce

弗雷德·布鲁克斯

Brooks, Fred

布朗,保罗

Brown, Paul

四边形图

quad chart

SciDB 代码线

SciDB codeline

科学数据管理文章

scientific data management article

丹尼尔·布鲁克纳

Bruckner, Daniel

数据驯服者项目

Data Tamer project

塔姆尔公司

Tamr company

缓冲区管理

Buffer management

联机事务处理

OLTP

支撑

Shore

岸上缓冲池管理器

Buffer Pool Manager in Shore

QUEL 中数据的批量复制

Bulk copy of data in QUEL

初创公司团队的商业头脑

Business acumen on startup company teams

企业对消费者 (B2C) 空间

Business-to-consumer (B2C) space

Paul Butterworth,商业 Ingres 代码线文章

Butterworth, Paul, commercial Ingres codeline article

Postgres 代码行中的 C 语言

C language in Postgres codelines

便利店项目

C-Store project

面向列的数据库

column-oriented database

COMMIT 语句

COMMIT statements

连接运算符

concat operators

覆盖投影集

covering sets of projections

删除

deletes

发射

launch

一种尺寸并不适合所有时代

one size doesn’t fit all era

表现

performance

主键

primary keys

原型

prototype

Vertica系统基于

Vertica Systems based on

便利店项目视角

C-Store project perspective

阿巴迪和计算机科学

Abadi and computer science

建筑

building

理念、演变和影响

idea, evolution, and impact

Vertica系统公司成立

Vertica Systems founding

便利店的开创性工作

C-Store seminal work

抽象的

abstract

结论

conclusion

数据模型

data model

介绍

introduction

表现

performance

查询执行

query execution

查询优化

query optimization

相关工作

related work

RS列存储

RS column store

存储管理

storage management

元组移动者

tuple movers

更新和交易

updates and transactions

WS列存储

WS column store

OLTP 中的缓存感知 B 树

Cache-conscious B-trees in OLTP

规则系统中的缓存

Caches in rules systems

迈克尔·卡法雷拉

Cafarella, Michael

加州理工学院 Mead-Conway VLSI 设计时代

Caltech Mead-Conway VLSI design era

职业流程图

Career flowchart

凯里,迈克尔·J.

Carey, Michael J.

数据存储能力

data storage capabilities

安格尔晚年文章

Ingres later years article

安格尔项目贡献

Ingres project contributions

米罗团队的卡恩斯、唐娜

Carnes, Donna, on Miro team

唐·卡尼

Carney, Don

极光项目

Aurora project

流库系统

StreamBase Systems

Carter、Fred,商业 Ingres 代码线文章

Carter, Fred, commercial Ingres codeline article

CASSM项目

CASSM project

Ingres 中的目录关系

Catalog relations in Ingres

乌吾尔州切廷泰梅尔

Çetintemel, Ugur

Aurora/Borealis/StreamBase 重聚

Aurora/Borealis/StreamBase reunion

极光项目

Aurora project

一种尺寸并不适合所有尺寸

one size does not fit all

一种尺寸适合所有开创性工作

one size fits all seminal work

流库系统

StreamBase Systems

唐·张伯伦

Chamberlin, Don

IBM数据库

IBM Database

关系-CODASYL 争论

relational-CODASYL debate

XQuery语言

XQuery language

陈乔利

Chen, Jolly

Postgres 转换

Postgres conversion

Postgres 解析器

Postgres parser

PostgreSQL

PostgreSQL

SQL化项目

SQLization project

陈沛南

Chen, Peinan

米奇·切尔尼亚克

Cherniack, Mitch

Aurora/Borealis/StreamBase 重聚

Aurora/Borealis/StreamBase reunion

极光项目

Aurora project

便利店的开创性工作

C-Store seminal work

专家采购

expert sourcing

流库系统

StreamBase Systems

塔姆尔项目

Tamr project

维蒂卡系统公司

Vertica Systems

Postgres 生产化中的小鸡测试

Chicken Test in Postgres productionization

莎莉·奇泽姆(佩妮饰)

Chisholm, Sally (Penny)

BigDAWG Polystore 系统的 Chisholm 实验室数据

Chisholm Laboratory data for BigDAWG polystore system

克莱顿·克里斯蒂安森

Christiansen, Clayton

柑橘数据库

CitusDB

班级、交易

Classes, transaction

气候变化和红杉计划

Climate change and Project Sequoia

CLOS(通用 LISP 对象系统)

CLOS (Common LISP Object System)

Ingres 中的 CLOSER 函数

CLOSER function in Ingres

云扳手

Cloud Spanner

云时代

Cloudera

和映射减少

and MapReduce

开源影响

open source impact

OLTP中的集群计算

Cluster computing in OLTP

CODASYL(数据系统语言会议)

CODASYL (Conference on Data Systems Languages)

科德报告

Codd report

数据库标准提案

database standard proposal

埃德加·科德(特德)

Codd, Edgar (Ted)

安格尔发展

Ingres development

安格尔的灵感来自于

Ingres inspired by

安格尔平台

Ingres platform

矩阵代数

matrix algebra

关系-CODASYL 争论

relational-CODASYL debate

科学的数据管理

scientific data management

SCM SIGFIDET 会议

SCM SIGFIDET conference

石布雷克的影响

Stonebraker influenced by

图灵奖

Turing Award

科赫拉公司

Cohera Corporation

斯通布雷克作品集

Collected works of Stonebraker

面向列的数据库

Column-oriented database

列存储架构

Column store architecture

艾尔·科莫

Comeau, Al

商业安格尔代码线

Commercial Ingres codeline

结论

conclusion

开源安格尔

open source Ingres

产品生产

product production

研究到商业努力

research to commercial efforts

存储结构

storage structures

用户定义类型

user-defined types

商业化

Commercialization

便利店项目

C-Store project

影响

impact for

Postgres

Postgres

通勤会员

Commuting members

VoltDB 中的压缩

Compaction in VoltDB

Stonebraker 创立的公司

Companies founded by Stonebraker

初创公司指南中的公司控制权

Company control in startup company guidelines

一刀切的兼容性问题

Compatibility problem in one size fits all

复杂的物体

Complex objects

安格尔

Ingres

Postgres

Postgres

复杂

Complexity

回避

avoiding

规则系统

rules systems

维蒂卡系统公司

Vertica Systems

C-Store项目中的压缩方法

Compression methods in C-Store project

计算机信息与控制工程(CICE)项目

Computer Information and Control Engineering (CICE) program

计算机科学学位

Computer science degrees

需要

need for

密歇根大学

University of Michigan

计算机系统研究组 (CSRG)

Computer Systems Research Group (CSRG)

Postgres 中的概念限制

Concepts limitation in Postgres

并发控制

Concurrency control

便利店

C-Store

分布式安格尔

Distributed Ingres

H店

H-Store

安格尔

Ingres

联机事务处理

OLTP

Postgres

Postgres

康多数据库

CondorDB

创新数据系统研究会议(CIDR)

Conference on Innovative Data Systems Research (CIDR)

BigDAWG Polystore 系统

BigDAWG polystore system

创建

creation

发射

launch

的成功

success of

奥罗拉 (Aurora) 的连接点

Connection points in Aurora

一致性

Consistency

联机事务处理

OLTP

奎尔

QUEL

约束树应用程序 (CTA)

Constrained tree applications (CTAs)

H-Store 项目中的约束树模式

Constrained tree schema in H-Store project

Postgres 中的构造类型

Constructed types in Postgres

Ingres 中的控制流程

Control flow in Ingres

克里斯蒂安·康维 (Convey) 谈 Aurora 项目

Convey, Christian, on Aurora project

安格尔项目版权

Copyright for Ingres project

CORBA

CORBA

一种尺寸的正确基元适合所有人

Correct primitives in one size fits all

一刀切的成本问题

Cost problem in one size fits all

Ingres 中的 CREATE 命令

CREATE command in Ingres

信用、分配

Credit, assigning

母球 (伙伴)

Cueball (partner)

便利店的当前时代

Current epochs in C-Store

顾客

Customers

忘记了

forgotten

初创公司

startup companies

流库

StreamBase

维蒂卡系统公司

Vertica Systems

OLTP 中的循环

Cycles in OLTP

Illustra 中的数据刀片

Data blades in Illustra

数据文明者

Data Civilizer

结论

conclusion

数据清理挑战

data cleaning challenge

数据发现挑战

data discovery challenge

数据转换挑战

data transformation challenge

描述

description

设计

design

分析师的生活

life of analysts

工作自动化

mung work automation

需要

need for

目的

purpose

Data Civilizer 中的数据清理挑战

Data cleaning challenge in Data Civilizer

数据发现

Data discovery

极光项目

Aurora project

数据文明者

Data Civilizer

数据驱动的发现范式

Data-driven discovery paradigm

数据摄取率 DBMS 限制

Data ingest rates DBMS limitation

数据语言/ALPHA

Data Language/ALPHA

数据管理技术 Kaiometer

Data Management Technology Kairometer

BigDAWG Polystore 系统中的数据移动

Data movement in BigDAWG polystore system

Ingres 中的数据结构

Data structures in Ingres

数据驯服者项目

Data Tamer project

创建

creation

顾客

customers

描述

description

想法和原型

idea and prototype

想法来源

ideas source

教训

lessons

研究贡献

research contributions

启动

startup

塔姆尔公司

Tamr company

Data Civilizer 中的数据转换挑战

Data transformation challenge in Data Civilizer

大规模数据统一。请参阅Data Tamer 项目

Data unification at scale. See Data Tamer project

数据仓库

Data warehouses

想法来源

ideas source

多个数据副本

multiple data copies

一种尺寸适合所有人

one size fits all

的兴起

rise of

图式

schemas

H-Store 中的数据库设计器

Database designer in H-Store

数据库管理系统 (DBMS) 描述

Database Management Systems (DBMSs) description

数据库,简史

Databases, brief history of

数据刀片

DataBlades

数据闪电战系统

DataBlitz system

Ingres 中的 DATADIR 目录

DATADIR directory in Ingres

数据记录规则系统

Datalog rule systems

日期,克里斯

Date, Chris

参照完整性论文

referential integrity paper

SCM SIGFIDET 会议

SCM SIGFIDET conference

大卫·马丁

David, Martin

DB2/400系统

DB2/400 system

i 系统的 Db2

Db2 for i system

用于 VSE&VM 的 Db2

Db2 for VSE&VM

Db2 for z/OS 系统

Db2 for z/OS system

DB2/MVS系统

DB2/MVS system

安格尔的DBA关系

DBA relations in Ingres

DBC项目

DBC project

安格尔的僵局

Deadlocks in Ingres

声明式查询语言,Ingres as

Declarative query language, Ingres as

分解程序

DECOMP program

便利店解压算子

Decompress operators in C-Store

Postgres 中的深度存储技术

Deep storage technologies in Postgres

Ingres 中的延迟更新和恢复

Deferred update and recovery in Ingres

Ingres 中的 DELETE 函数

DELETE function in Ingres

C-Store 中已删除的记录向量 (DRV)

Deleted record vectors (DRVs) in C-Store

存储中的密集包装值

Densepack values in storage

交易类别深度

Depth of transaction classes

“Ingres 的设计与实现”(Stonebraker)

“Design and Implementation of Ingres” (Stonebraker)

Postgres 中的设计问题

Design problems in Postgres

德威特,大卫 J.

DeWitt, David J.

50年展望文章

50-year perspective article

CIDR

CIDR

CondorDB版本

CondorDB version

伽玛

Gamma

H店项目

H-Store project

Hadoop 批评

Hadoop criticism

一种尺寸并不适合所有时代

one size doesn’t fit all era

出版物

publications

岸上原型机

Shore prototype

维蒂卡系统公司

Vertica Systems

直接项目

DIRECT project

磁盘方向 DBMS 限制

Disk orientation DBMS limitation

VoltDB 中的磁盘持久化

Disk persistence in VoltDB

便利店的独特价值

Distinct values in C-Store

C-Store 中的分布式 COMMIT 处理

Distributed COMMIT processing in C-Store

分布式计算

Distributed computing

分布式数据库

Distributed databases

分布式安格尔

Distributed Ingres

BigDAWG 的 Docker 工具

Docker tool for BigDAWG

Ingres 中的文档数据管理

Document data management in Ingres

杰夫·多齐尔

Dozier, Jeff

杜江

Du, Jiang

Postgres 中的动态加载

Dynamic loading in Postgres

H-Store 中的动态锁定

Dynamic locking in H-Store

亚当·杰季奇

Dziedzic, Adam

早年和教育

Early years and education

大象

Elephants

安迪·埃利科特

Ellicott, Andy

拉里·埃里森

Ellison, Larry

甲骨文声称

Oracle claims

SQL语言支持

SQL language support

Elmore, Aaron J.,BigDAWG polystore 系统文章

Elmore, Aaron J., BigDAWG polystore system article

米罗团队的理查德·恩伯森

Emberson, Richard, on Miro team

EMP1(朋友)

EMP1 (friend)

C-Store 中的编码方案

Encoding schemes in C-Store

“建筑时代的终结:是时候彻底重写了”论文(Stonebraker)

“End of an Architectural Era: It’s Time for a Complete Rewrite” paper (Stonebraker)

建筑时代终结的开创性作品

End of architectural era seminal work

抽象的

abstract

H店

H-Store

介绍

introduction

OLTP 设计注意事项

OLTP design considerations

一种尺寸并不适合所有评论

one size does not fit all comments

表现

performance

总结及下一步工作

summary and future work

交易、处理和环境假设

transaction, processing and environment assumptions

便利店时代的终结

End of epochs in C-Store

端到端系统 Data Civilizer 设计

End-to-end system Data Civilizer design

企业数据库

EnterpriseDB

企业家变身鲨鱼(朋友)

Entrepreneur-Turned-Shark (friend)

便利店的时代

Epochs in C-Store

鲍勃·爱泼斯坦

Epstein, Bob

BSD许可证

BSD license

分布式安格尔

Distributed Ingres

安格尔源代码

Ingres source code

存储过程

stored procedures

风险投资家联系方式

venture capitalist contact

Ingres 的 EQUEL 语言

EQUEL language for Ingres

评论

comments

调用自

invocation from

概述

overview

欧文、克里斯蒂娜谈 Aurora 项目

Erwin, Christina, on Aurora project

ETL工具包

ETL toolkit

规则系统中的例外情况

Exceptions in rules systems

形式主义过多

Excessive formalism

领域扩大,无法应对

Expanding fields, failure to cope with

BigDAWG Polystore 系统中的扩展执行

Expansive execution in BigDAWG polystore system

OLTP的实验结果

Experimental results for OLTP

专家数据库系统,面向 Ingres

Expert database systems, Ingres-oriented

专家采购

Expert sourcing

Tamr 中的显式并行性

Explicit parallelism in Tamr

超大数据库 (XLDB) 会议和研讨会

Extremely Large Data Bases (XLDB) conference and workshop

法布里,鲍勃

Fabry, Bob

单一因素适用于所有情况

Factoring in one size fits all

故障转移

Failover

联机事务处理设计

OLTP design

一种尺寸适合所有人

one size fits all

失败

Failures

结果

consequences

拓展领域

expanding fields

被遗忘的顾客

forgotten customers

纸张泛滥

paper deluge

概括

summary

Postgres 中的快速路径功能

Fast path feature in Postgres

联邦架构

Federation architecture

女博士毕业

Female Ph.D.s graduated

费尔南德斯、劳尔·卡斯特罗

Fernandez, Raul Castro

Aurum研究故事

Aurum research story

数据文明者

Data Civilizer

数据文明者文章

Data Civilizer article

节日文集

Festschrift

文件

Files

安格尔

Ingres

UNIX环境

UNIX environment

一刀切的金融饲料处理

Financial-feed processing in one size fits all

Ingres 中的 FIND 函数

FIND function in Ingres

初创公司的首批客户

First customers for startup companies

第一届国际专家数据库系统会议

First International Conference on Expert Database Systems

丹尼斯·福格

Fogg, Dennis

福特、吉姆

Ford, Jim

便利店中的外键

Foreign keys in C-Store

便利店外国订单金额

Foreign-order values in C-Store

被遗忘的顾客

Forgotten customers

OLTP 设计的叉车式升级

Fork-lift upgrades in OLTP design

马克·福尼尔

Fournier, Marc

富兰克林,迈克尔·J.

Franklin, Michael J.

论文被拒绝

papers rejected by

电报队

Telegraph Team

吉姆·弗鲁

Frew, Jim

克里斯托夫·弗雷塔格

Freytag, Christoph

功能理念

Functional ideas

功能

Functions

Postgres

Postgres

后传

POSTQUEL

维杰·加德帕利

Gadepally, Vijay

BigDAWG 代码线文章

BigDAWG codeline article

BigDAWG 发布

BigDAWG releases

埃迪·加尔维斯

Galvez, Eddie

极光项目

Aurora project

流库系统

StreamBase Systems

伽玛计划

Gamma project

大蒜项目

Garlic project

盖茨、比尔

Gates, Bill

创业板语言

GEM language

广义搜索树 (GiST) 接口

Generalized Search Tree (GiST) interface

SciDB 的基因组数据

Genomic data for SciDB

地理信息系统(GIS)

Geographic Information Systems (GIS)

Ingres 中的 GET 函数

GET function in Ingres

千兆示波器项目

Gigascope project

全球生物样本库引擎

Global Biobank Engine

走吧,安吉拉

Go, Angela

虾虎鱼公司

Goby company

B2C空间

B2C space

数据驯服者

Data Tamer

启动

startup

谷歌技术

Google technologies

杰夫·戈斯

Goss, Jeff

戴夫·戈塞林

Gosselin, Dave

州长学院

Governor’s Academy

GPU

GPUs

研究生

Graduate students

原型中的图形用户界面 (GUI)

Graphical user interfaces (GUIs) in prototypes

草溪公司

Grassy Brook company

创立

founding

四边形图

quad chart

格雷、吉姆

Gray, Jim

2002年SIGMOD会议

2002 SIGMOD conference

CIDR

CIDR

红杉计划

Project Sequoia

科学的数据管理

scientific data management

系统R项目

System R project

串联计算机

Tandem Computers

图灵奖

Turing Award

大关系-CODASYL 辩论

Great Relational-CODASYL Debate

Greenplum 启动

Greenplum startup

OLTP 设计中的网格计算

Grid computing in OLTP design

安库什·古普塔

Gupta, Ankush

安东尼·古特曼

Guttman, Antonin

Ingres CAD 管理功能

Ingres CAD management features

R-Tree索引结构

R-Tree index structure

H店

H-Store

基本策略

basic strategy

哥们

buddies

结论

conclusion

数据库设计师

database designer

描述

description

执行主管

execution supervisor

创立

founding

一般交易

general transactions

想法来源

ideas source

一种尺寸并不适合所有时代

one size doesn’t fit all era

表现

performance

原型

prototypes

查询执行

query execution

系统架构

system architecture

交易类

transaction classes

事务管理、复制和恢复

transaction management, replication and recovery

VoltDB 和 PayPal

VoltDB and PayPal

VoltDB基于

VoltDB based on

VoltDB执行器在

VoltDB executor in

VoltDB分裂

VoltDB split

纳比尔·哈赫姆

Hachem, Nabil

数据文明者

Data Civilizer

建筑时代终结的开创性作品

end of architectural era seminal work

唐·哈德勒回忆文章

Haderle, Don, recollections article

Hadoop

Hadoop

批评

criticism of

开源影响

open source impact

Hadoop 分布式文件系统 (HDFS)

Hadoop Distributed File System (HDFS)

戴尔·哈根

Hagen, Dale

汉密尔顿,詹姆斯

Hamilton, James

2014年ACM图灵奖

on 2014 ACM Turing Award

IBM 关系数据库代码库文章

IBM relational database code bases article

服务器成本

on server costs

约阿希姆·哈默

Hammer, Joachim

汉森,埃里克

Hanson, Eric

斯塔夫罗斯·哈里佐普洛斯

Harizopoulos, Stavros

建筑时代终结的开创性作品

end of architectural era seminal work

H店原型

H-Store prototype

OLTP数据库

OLTP databases

赫歇尔·哈里斯

Harris, Herschel

商业 Ingres 代码线中的 HASH 结构

HASH structure in commercial Ingres codeline

马特·哈图恩 (Hatoun) 谈 Aurora 项目

Hatoun, Matt, on Aurora project

山楂,保拉

Hawthorn, Paula

伊洛斯特拉

Illustra

米罗团队

Miro team

Postgres 生产化

Postgres productionization

赫斯特、马蒂,学生观点文章

Hearst, Marti, student perspective article

希斯·博比

Heath, Bobbi

H店原型

H-Store prototype

流库系统

StreamBase Systems

理查德·赫奇斯

Hedges, Richard

海森巴格

Heisenbugs

杰拉德·霍尔德

Held, Gerald

“重新审视 B 树”

“B-Trees Re-examined”

Ingres 实施开创性工作

Ingres implementation seminal work

关系数据库产业诞生

relational database industry birth

Helland,Pat,建筑时代终结的开创性作品

Helland, Pat, end of architectural era seminal work

海勒斯坦,约瑟夫·M.

Hellerstein, Joseph M.

数据驯服者

Data Tamer

Postgres 代码线

Postgres codelines

Postgres 描述

Postgres description

Postgres 观点文章

Postgres perspective article

Postgres 项目

Postgres project

塔姆尔项目

Tamr project

电报队

Telegraph Team

Heroku 提供商

Heroku provider

高可用性

High availability

联机事务处理

OLTP

一种尺寸适合所有人

one size fits all

便利店中的高水位线 (HWM)

High water mark (HWM) in C-Store

希尔,信仰

Hill, Faith

Postgres 中的提示

Hints in Postgres

HiPac 项目

HiPac project

Hirohama、Michael、Postgres 实施开创性工作

Hirohama, Michael, Postgres implementation seminal work

历史模式查询

Historical mode queries

Hive 执行器

Hive executor

比尔·霍比布

Hobbib, Bill

洪伟

Hong, Wei

伊洛斯特拉

Illustra

米罗团队

Miro team

Postgres 和 Illustra 代码行文章

Postgres and Illustra codelines article

Postgres 转换

Postgres conversion

XPRS架构

XPRS architecture

地平线版本

Horizontica version

OLTP设计中的热备

Hot standby in OLTP design

豪、比尔

Howe, Bill

HTCondor 项目

HTCondor project

拥抱,约翰

Hugg, John

H店原型

H-Store prototype

VoltDB 代码线文章

VoltDB codeline article

马特·胡拉斯

Huras, Matt

Hwang, Jeong-Hyon 谈 Aurora 项目

Hwang, Jeong-Hyon, on Aurora project

安格尔的假设关系

Hypothetical relations in Ingres

国际商业机器公司

IBM

IMS数据库

IMS database

SQL语言

SQL language

IBM 关系数据库代码库

IBM relational database code bases

四个代码库

four code bases

未来

future

便携式代码库

portable code base

2015 年 IEEE 国际数据工程会议演讲

IEEE International Conference on Data Engineering 2015 talk

Illustra 代码线

Illustra codelines

概述

overview

Postgres 和 SQL

Postgres and SQL

Postgres 生产化

Postgres productionization

易乐思信息技术公司

Illustra Information Technologies, Inc.

开源影响

open source impact

甲骨文竞争对手

Oracle competitor

Postgres

Postgres

Postgres 商业改编

Postgres commercial adaptations

启动

startup

伊利亚斯·伊哈布

Ilyas, Ihab

Data Tamer 项目文章

Data Tamer project article

塔姆尔联合创始人

Tamr co-founder

规则体系的执行效率

Implementation efficiency in rules systems

“通过查询修改实现完整性约束和视图” (Stonebraker)

“Implementation of Integrity Constraints and Views By Query Modification” (Stonebraker)

“Postgres 的实现”(Stonebraker)

“Implementation of Postgres” (Stonebraker)

实施细则

Implementing rules

IMS数据库

IMS database

内QTel

In-QTel

入站与出站处理一刀切

Inbound vs. outbound processing in one size fits all

安格尔索引目录

INDEX catalog in Ingres

索引

Indexes

便利店

C-Store

商业安格尔

commercial Ingres

Postgres

Postgres

小学和中学

primary and secondary

伏特数据库

VoltDB

工业联络计划 (ILP)

Industrial Liaison Program (ILP)

行业参与

Industry involvement

无限带宽

InfiniBand

信息系统

Informix

Illustra 集成到

Illustra integrated into

伊路斯特拉购买

Illustra purchase

初创公司被收购

startups bought by

Informix 通用服务器

Informix Universal Server

Ingres 实施开创性工作

Ingres implementation seminal work

访问方法接口

Access Methods Interface

结论

conclusion

并发控制

concurrency control

数据结构和访问方法

data structures and access methods

延迟更新和恢复

deferred update and recovery

伊奎尔

EQUEL

文件结构

file structure

未来的扩展

future extensions

介绍

introduction

从 EQUEL 调用

invocation from EQUEL

从 UNIX 调用

invocation from UNIX

表现

performance

过程

Process

过程

Process

过程

Process

流程结构

process structure

QUEL 和实用命令

QUEL and utility commands

查询修改

query modification

存储结构

storage structures

系统目录

system catalogs

UNIX环境

UNIX environment

用户反馈

user feedback

安格尔晚年

Ingres later years

贡献

contributions

分布式安格尔

Distributed Ingres

关系型数据库管理系统

relational DBMS

支持域

support domains

安格尔项目

Ingres project

属性目录

ATTRIBUTE catalog

伯克利年

Berkeley years

的诞生

birth of

和BSD

and BSD

商业代码线。请参阅商业 Ingres 代码线

commercial codeline. See Commercial Ingres codeline

竞赛

competition

复制命令

COPY command

版权

copyright

查询分解

decomposition of queries

分散式

distributed

想法来源

ideas source

影响

impact

领导和宣传

leadership and advocacy

开源

open source

平台

platform

Postgres 设计帮助

Postgres design helped by

流程结构

process structure

目标平台

target platform

团队

team

定时

timing

威斯康星州,1976 年秋季

Wisconsin, fall 1976

安格尔之星

Ingres Star

Postgres 中的继承

Inheritance in Postgres

创新者的困境(克里斯蒂安森)

Innovators Dilemma (Christiansen)

Ingres 中的 INSERT 函数

INSERT function in Ingres

C-Store 中的插入向量 (IV)

Insertion vectors (IVs) in C-Store

便利店插页

Inserts in C-Store

OLTP 中的指令与周期

Instructions vs. cycles in OLTP

Ingres 的 INTEGRITY 目录

INTEGRITY catalog in Ingres

QUEL 中的完整性控制

Integrity control in QUEL

英特尔大数据科技中心 ​​(ISTC)

Intel Science and Technology Center (ISTC) for Big Data

知识产权

Intellectual property

VoltDB 中的快照间日志

Inter-snapshot log in VoltDB

中级H店策略

Intermediate H-Store strategy

反转文件系统

Inversion file system

无关的理论

Irrelevant theories

ISAM(索引顺序存取方法)

ISAM (indexed sequential access method)

BigDAWG Polystore 系统中的孤岛

Islands in BigDAWG polystore system

连接 C-Store 中的索引

Join indexes in C-Store

加入便利店运营商

Join operators in C-Store

琼斯,安妮塔

Jones, Anita

琼斯,埃文

Jones, Evan

乔伊,比尔

Joy, Bill

JSON数据模型

JSON data model

K-安全系统

K-safe systems

兰迪·卡茨

Katz, Randy

韩国数据库系统

KDB system

凯利,加里

Kelley, Gary

杰里米·凯普纳

Kepner, Jeremy

拉里·克什伯格

Kerschberg, Larry

克斯滕·马丁

Kersten, Martin

Ingres 中的键控存储结构

Keyed storage structure in Ingres

按键

Keys

便利店

C-Store

Postgres

Postgres

木村秀明

Kimura, Hideaki

Kinchen,Jason,SciDB 代码线文章

Kinchen, Jason, SciDB codeline article

KISS 格言(保持简单,愚蠢)格言

KISS adage (Keep it Simple, Stupid) adage

Postgres 中的知识管理

Knowledge management in Postgres

艾略特·克努森

Knudsen, Eliot

库伊,鲍勃

Kooi, Bob

蒂姆·克拉斯卡

Kraska, Tim

彼得·克雷普斯

Kreps, Peter

Ingres 实施开创性工作

Ingres implementation seminal work

安格尔团队

Ingres team

陆鲨

Land Sharks

罗伯特·兰格

Langer, Robert

语言构造作为 DBMS 的限制

Language constructs as DBMS limitation

Postgres 中的语言支持

Language support in Postgres

Postgres 中的大对象

Large objects in Postgres

大型综合巡天望远镜

Large Synoptic Survey Telescope

闭锁

Latching

H店

H-Store

联机事务处理

OLTP

支撑

Shore

VoltDB 中的延迟

Latency in VoltDB

Lau, Edmond,便利店的开创性作品

Lau, Edmond, C-Store seminal work

希尔帕·拉万德

Lawande, Shilpa

维蒂卡系统公司

Vertica Systems

Vertica Systems 代码线文章

Vertica Systems codeline article

领导力、伙伴关系方法

Leadership, partnership approach to

领导和宣传

Leadership and advocacy

宣传

advocacy

机制

mechanisms

系统

systems

最少可发布单位 (LPU)

Least Publishable Units (LPUs)

颗粒优化

grain optimization

问题来自

problems from

迈克·莱本斯佩格

Leibensperger, Mike

卢·伊万

Lew, Ivan

刘易斯·布莱恩

Lewis, Bryan

许可证、BSD

licenses, BSD

初创公司的灯塔客户

Lighthouse customers for startup companies

山姆·莱特斯通

Lightstone, Sam

林赛·布鲁斯

Lindsay, Bruce

科学数据管理中的谱系支持

Lineage support in scientific data management

线性道路基准

Linear Road benchmark

初创公司指南中的清算优先权

Liquidation preference in startup company guidelines

芭芭拉·利斯科夫

Liskov, Barbara

利斯曼,约翰

Lisman, John

用于 Postgres 的 LISP

LISP for Postgres

刘杰森,在 Tamr

Liu, Jason, at Tamr

岸上船闸管理员

Lock Manager in Shore

锁定

Locking

便利店

C-Store

H店

H-Store

安格尔

Ingres

联机事务处理

OLTP

绩效评估

performance evaluation of

支撑

Shore

Postgres 中以日志为中心的存储和恢复

Log-centric storage and recovery in Postgres

岸上的日志管理器

Log Manager in Shore

便利店逻辑数据模型

Logical data model in C-Store

无日志数据库

Logless databases

项目徽标和 T 恤

Logos and T-shirts for projects

日志和日志记录

Logs and logging

H店

H-Store

联机事务处理

OLTP

用于恢复目的

for recovery purposes

重做

redo

支撑

Shore

撤消

undo

伏特数据库

VoltDB

盖伊·洛曼

Lohman, Guy

雷蒙德·洛里

Lorie, Raymond

便利店低水位线 (LWM)

Low water mark (LWM) in C-Store

LSM树概念

LSM-tree concept

拉克克,JR

Lucking, J. R.

MacAIMS数据管理系统

MacAIMS Data Management System

南希·麦克唐纳

MacDonald, Nancy

机器学习

Machine learning

马登,塞缪尔

Madden, Samuel

大DAWG

BigDAWG

便利店项目

C-Store project

便利店的开创性工作

C-Store seminal work

建筑时代终结的开创性作品

end of architectural era seminal work

节日文集

Festschrift

H店原型

H-Store prototype

国际标准技术委员会

ISTC

OLTP 数据库开创性工作

OLTP databases seminal work

研究贡献文章

research contributions article

维蒂卡系统公司

Vertica Systems

塞缪尔·马登在 Stonebraker 上

Madden, Samuel, on Stonebraker

学术生涯和安格尔的诞生

academic career and birth of Ingres

宣传

advocacy

奖项和荣誉

awards and honors

公司成立

companies founded

早年和教育

early years and education

工业、麻省理工学院和新千年

industry, MIT, and new millennium

遗产

legacy

个人生活

personal life

后安格尔时代

post-Ingres years

概要

synopsis

MADlib库

MADlib library

马霍尼,科林

Mahony, Colin

大卫·迈尔

Maier, David

主存储器

Main memory

联机事务处理设计

OLTP design

学习

studies

“让 Smalltalk 成为一个数据库系统”(Copeland 和 Maier)

“Making Smalltalk a Database System” (Copeland and Maier)

映射减少

MapReduce

博客文章

blog post

批评

criticism of

和 Postgres

and Postgres

马里波萨系统

Mariposa system

描述

description

联邦架构

federation architecture

原型

prototype

马克、罗杰

Mark, Roger

一刀切的营销问题

Marketing problem in one size fits all

火星系统

MARS system

便利店口罩操作员

Mask operators in C-Store

Mattson、Tim、BigDAWG Polystore 系统文章

Mattson, Tim, BigDAWG polystore system article

马特·麦克莱恩

McCline, Matt

麦克奈特,凯西

McKnight, Kathy

约翰·麦克弗森

McPherson, John

詹姆斯·麦克奎斯顿

McQueston, James

MDM(主数据管理)

MDM (master data management)

约翰·米汉

Meehan, John

记忆

Memory

联机事务处理设计

OLTP design

学习

studies

OLTP 中的内存驻留数据库

Memory resident databases in OLTP

默克数据库

Merck databases

杰夫·梅雷迪思

Meredith, Jeff

伊洛斯特拉

Illustra

米罗团队

Miro team

Postgres

Postgres

合并出过程

Merge-out process

一种适合所有人的消息传输

Message transport in one size fits all

大规模数据管理方法和系统专利

Method and System for Large Scale Data Curation patent

MIMIC(重症监护中的多参数智能监测)数据集

MIMIC (Multiparameter Intelligent Monitoring in Intensive Care) dataset

米罗团队

Miro team

味噌系统

Miso system

“地球使命”(MTPE)努力

“Mission to Planet Earth” (MTPE) effort

初创公司指南中的错误

Mistakes in startup company guidelines

麻省理工学院

MIT

2005年休假

2005 sabbatical

2016年

2016

Aurora 和 StreamBase 项目

Aurora and StreamBase projects

工业联络计划

Industrial Liaison Program

研究贡献

research contributions

麻省理工学院计算机科学与人工智能实验室

MIT CSAIL

OVQP 中的 MODIFY 命令

MODIFY command in OVQP

莫汉,C.

Mohan, C.

妈妈(朋友)

Mom (friend)

MonetDB项目

MonetDB project

米罗团队的加里·摩根塔勒

Morgenthaler, Gary, on Miro team

睡眠计划

Morpheus project

描述

description

原型

prototype

启动

startup

莫里斯·巴里

Morris, Barry

Aurora/Borealis/StreamBase 重聚

Aurora/Borealis/StreamBase reunion

流库系统

StreamBase Systems

马修·穆克洛

Mucklo, Matthew

Muffin 并行数据库

Muffin parallel databases

松饼原型

MUFFIN prototype

OLTP 中的多核支持

Multi-core support in OLTP

OLTP设计中的多线程

Multi-threading in OLTP design

Postgres 中的多语言访问

Multilingual access in Postgres

迈里亚项目

Myria project

MySQL

MySQL

乔恩·纳克鲁德

Nakerud, Jon

美国宇航局“地球使命”的努力

NASA “Mission to Planet Earth” effort

美国国家科学基金会 (NSF)

National Science Foundation (NSF)

提案成功率

proposal success rate

RANN程序

RANN program

杰夫·诺顿

Naughton, Jeff

Naumann、Felix,RDBMS 谱系文章

Naumann, Felix, RDBMS genealogy article

航海时代

Navigational era

内勒,阿奇

Naylor, Arch

Postgres 中的嵌套查询

Nested queries in Postgres

Netezza 初创公司

Netezza startup

VoltDB 的网络时间协议 (NTP)

Network Time Protocol (NTP) for VoltDB

OLTP 中的新订单事务

New Order transactions in OLTP

NewSQL架构

NewSQL architecture

Postgres 中的无覆盖存储管理器

No-overwrite storage manager in Postgres

非易失性RAM

Non-volatile RAM

Ingres 中的非键控存储结构

Nonkeyed storage structure in Ingres

不可重复的错误

Nonrepeatable errors

Postgres 中的普通函数

Normal functions in Postgres

NoSQL 系统

NoSQL systems

诺华生物医学研究所 (NIBR)

Novartis Institute for Biomedical Research (NIBR)

Postgres 中的对象标识符 (OID)

Object identifiers (OIDs) in Postgres

商业 Ingres 中的对象管理扩展

Object Management Extension in commercial Ingres

Postgres 实现中的对象管理

Object management in Postgres implementation

“Postgres 中的对象管理使用过程”(Stonebraker)

“Object Management in Postgres Using Procedures” (Stonebraker)

Postgres 中的面向对象

Object-orientation in Postgres

面向对象的数据库 (OODB)

Object-Oriented Databases (OODBs)

对象关系 DBMS:追踪下一个浪潮(Stonebraker 和 Brown)

Object-Relational DBMSs: Tracking the Next Great Wave (Stonebraker and Brown)

对象关系模型

Object-Relational model

奥布莱恩,凯尔

O’Brien, Kyle

克莱尔·奥康奈尔

O’Connell, Claire

奥尔森、迈克

Olson, Mike

伊洛斯特拉

Illustra

反转文件系统

Inversion file system

开源文章

open source article

Postgres B 树实现

Postgres B-tree implementation

Postgres 代码线

Postgres codelines

H-Store项目中的OLTP(在线事务处理)应用

OLTP (Online Transaction Processing) applications in H-Store project

OLTP(在线事务处理)数据库开创性工作

OLTP (Online Transaction Processing) databases seminal work

抽象的

abstract

替代 DBMS 架构

alternative DBMS architectures

缓存感知 B 树

cache-conscious B-trees

结论

conclusion

并发控制

concurrency control

贡献和论文组织

contributions and paper organization

实验结果

experimental results

未来的发动机

future engines

指令与周期

instructions vs. cycles

介绍

introduction

多核支持

multi-core support

新订单交易

New Order transactions

间接费用

overheads

支付

payment

绩效研究

performance study

相关工作

related work

复制管理

replication management

结果

results

设置和测量方法

setup and measurement methodology

支撑

Shore

吞吐量

throughput

趋势

trends

弱一致性

weak consistency

OLTP(在线事务处理)设计注意事项

OLTP (Online Transaction Processing) design considerations

网格计算和叉车升级

grid computing and fork-lift upgrades

高可用性

high availability

旋钮

knobs

主存储器

main memory

多线程和资源控制

multi-threading and resource control

支付交易

payment transactions

“OLTP:透过镜子”论文 (Harizopoulos)

“OLTP: Through the Looking Glass” paper (Harizopoulos)

一次性应用

One-shot applications

一次性交易

One-shot transactions

一种尺寸并不适合所有情况

One size does not fit all

BigDAWG Polystore 系统

BigDAWG polystore system

建筑时代末期的开创性作品

in end of architectural era seminal work

概述

overview

研究贡献

research contributions

专用数据库系统

special-purpose database systems

一刀切:一个想法,其时代已到来,又已成为开创性的工作

One size fits all: An idea whose time has come and gone seminal work

抽象的

abstract

结论

conclusion

正确的原语

correct primitives

数据仓库

data warehouses

DBMS 处理和应用程序逻辑集成

DBMS processing and application logic integration

保理

factoring

金融饲料加工

financial-feed processing

高可用性

high availability

入站处理与出站处理

inbound versus outbound processing

介绍

introduction

表现

performance

科学数据库

scientific databases

基于传感器的应用

sensor-based applications

传感器网络

sensor networks

流处理

stream processing

同步

synchronization

文本搜索

text search

XML数据库

XML databases

安格尔中的一变量分离

One-variable detachment in Ingres

Ingres 中的一变量查询处理器 (OVQP)

One-Variable Query Processor (OVQP) in Ingres

奥尼尔、帕特、便利店的开创性作品

O’Neil, Pat, C-Store seminal work

翁詹姆斯

Ong, James

开源

Open source

BSD 和安格尔

BSD and Ingres

BSD许可证

BSD license

安格尔影响

Ingres impact

开源安格尔

open source Ingres

后安格尔

post-Ingres

Postgres

Postgres

PostgreSQL

PostgreSQL

研究影响

research impact

安格尔中的 OPENR 函数

OPENR function in Ingres

“数据管理的操作系统支持”(Stonebraker 和 Kumar)

“Operating System Support for Data Management” (Stonebraker and Kumar)

运营商

Operators

便利店查询

C-Store queries

Postgres

Postgres

科学的数据管理

scientific data management

岸上优化

Optimizations in Shore

OQL语言

OQL language

甲骨文公司

Oracle Corporation

与竞争

competition with

性能声明

performance claims

Postgres 攻击者

Postgres attack by

塔姆尔

Tamr

Orca 优化器

Orca optimizer

OS/2 数据库管理器

OS/2 Database Manager

OS/2系统

OS/2 system

约翰·奥斯特豪特

Ousterhout, John

穆拉德·乌扎尼

Ouzzani, Mourad

数据文明者

Data Civilizer

数据文明者文章

Data Civilizer article

StreamBase 中的“Over My Dead Body”问题

“Over My Dead Body” issues in StreamBase

OLTP 中的开销

Overheads in OLTP

异教徒,亚历山大

Pagan, Alexander

数据驯服者项目

Data Tamer project

塔姆尔公司

Tamr company

安迪·帕尔默

Palmer, Andy

2014年图灵奖颁奖典礼

2014 Turing Award Ceremony

“母球”

“Cue Ball”

数据驯服者项目

Data Tamer project

节日文集

Festschrift

创业公司文章

startup company article

塔姆首席执行官

Tamr CEO

塔姆尔公司

Tamr company

维蒂卡系统公司

Vertica Systems

纸张泛滥

Paper deluge

纸张要求

Paper requirements

平行加速

ParAccel

范例

Paradigm

天堂项目

Paradise project

并行数据库思想源码

Parallel databases ideas source

并行数据仓库项目,微软

Parallel Data Warehouse project, Microsoft

Postgres 中的并行 DBMS

Parallel DBMS in Postgres

并行系统综合体

Parallel Sysplex

Tamr 中的并行性

Parallelism in Tamr

Ingres 中的 PARAMD 函数

PARAMD function in Ingres

Ingres 中的 PARAMI 功能

PARAMI function in Ingres

安格尔的解析器

Parsers in Ingres

便利店分区

Partitions in C-Store

合作伙伴

Partnerships

领导方法

leadership approach

初创公司

startup companies

帕特里奇,约翰

Partridge, John

Aurora/Borealis/StreamBase 重聚

Aurora/Borealis/StreamBase reunion

极光语言

Aurora language

连接点

connection points

射频识别标签

RFID tagging

StreamBase 客户

StreamBase customers

StreamBase成立

StreamBase founding

StreamBase问题

StreamBase issues

过去的数据访问作为 DBMS 的限制

Past data access as DBMS limitation

专利

Patents

Postgres 中的路径表达式

Path expressions in Postgres

戴夫·帕特森

Patterson, Dave

安德鲁·帕夫洛

Pavlo, Andrew

H店项目

H-Store project

H-Store项目文章

H-Store project article

H-Store 中的 VoltDB 执行器

VoltDB executor in H-Store

贝宝

PayPal

皮尔逊相关系数 (PCC)

Pearson Correlation Coefficient (PCC)

初创公司人员

People for startup companies

表现

Performance

BigDAWG Polystore 系统

BigDAWG polystore system

瓶颈研究

bottleneck studies

便利店

C-Store

数据统一

Data Unification

H店

H-Store

安格尔

Ingres

锁定方式

locking methods

联机事务处理

OLTP

一种尺寸适合所有人

one size fits all

Postgres

Postgres

排列便利商店中的运算符

Permute operators in C-Store

珍妮特·佩尔纳

Perna, Janet

坚持

Persistence

Postgres

Postgres

伏特数据库

VoltDB

持久CLOS

Persistent CLOS

持久重做日志

Persistent redo logs

斯通布雷克的个人生活

Personal life of Stonebraker

彼得利 IS/1 系统

Peterlee IS/1 System

博士 纸张要求

Ph.D. paper requirements

管道

Pipes

哈米德·皮拉赫什

Pirahesh, Hamid

初创公司指南中的推介材料

Pitch decks in startup company guidelines

枢纽公司

Pivotal company

便利店查询计划

Plans for C-Store queries

Poliakov、Alex,SciDB 代码线文章

Poliakov, Alex, SciDB codeline article

多碱系统

Polybase system

多商店。请参阅BigDAWG Polystore 系统

Polystores. See BigDAWG polystore system

IBM 关系数据库的可移植代码库

Portable code base for IBM relational databases

后安格尔时代

Post-Ingres years

开源

open source

概述

overview

Postgres 代码线

Postgres codelines

结论

conclusion

PostgreSQL

PostgreSQL

原型

prototype

Postgres设计

Postgres design

基本类型

base typess

自行车旅行隐喻

bicycle trip metaphor

结论

conclusion

怡乐思买断

Illustra buyout

遗产

inheritance

安格尔帮助

Ingres help for

互联网腾飞

Internet takeoff

陆鲨

Land Sharks

营销挑战

marketing challenge

性能基准

performance benchmark

减速带

speedbumps

开始

start

Postgres 实施开创性工作

Postgres implementation seminal work

抽象的

abstract

结论

conclusion

数据模型

data model

数据模型和查询语言概述

data model and query language overview

数据模型批判

data model critique

设计问题

design problems

动态加载和流程结构

dynamic loading and process structure

快速路径功能

fast path feature

实施介绍

implementation introduction

介绍

introduction

面向对象

object-orientation

POSTQUEL 查询语言

POSTQUEL query language

编程语言

programming language

规则系统

rules systems

状态和表现

status and performance

存储系统

storage systems

Postgres 视角

Postgres perspective

活动数据库和规则系统

active databases and rule systems

商业改编

commercial adaptations

语境

context

深度存储技术

deep storage technologies

语言支持

language support

教训

lessons

以日志为中心的存储和恢复

log-centric storage and recovery

概述

overview

软件影响

software impact

XPRS架构

XPRS architecture

Postgres 项目

Postgres project

抽象数据类型

abstract data types

描述

description

想法来源

ideas source

伊路斯特拉购买

Illustra purchase

影响

impact

后传

POSTQUEL

生产化

productionization

满意

satisfaction with

和 SQL

and SQL

开始

start of

PostgreSQL

PostgreSQL

创建

creation

影响

impact

开源

open source

软件架构

software architecture

POSTQUEL 查询语言

POSTQUEL query language

特征

features

功能

functions

斯皮罗斯·波塔米亚诺斯

Potamianos, Spyros

初创公司的实用主义

Pragmatism in startup companies

定价模型

Pricing models

主副本复制控制

Primary-copy replication control

主要指标

Primary indexes

一种尺寸的基元适合所有人

Primitives in one size fits all

普林斯顿大学

Princeton University

科学数据管理中的概率推理

Probabilistic reasoning in scientific data management

问题

Problems

忽略

ignoring

解决

solving

安格尔的流程 2

Process 2 in Ingres

安格尔的流程 3

Process 3 in Ingres

安格尔的流程 4

Process 4 in Ingres

流程结构

Process structure

安格尔

Ingres

Postgres

Postgres

便利店项目运营商

Project operators in C-Store

氧气计划

Project Oxygen

红杉计划

Project Sequoia

2000年

2000

Postgres

Postgres

便利店的投影

Projections in C-Store

安格尔的保护目录

PROTECTION catalog in Ingres

原型

Prototypes

ADT-安格尔

ADT-Ingres

数据驯服者项目

Data Tamer project

H店项目

H-Store project

马里波萨

Mariposa

睡眠

Morpheus

松饼

MUFFIN

噪音

noise in

Postgres

Postgres

支撑

Shore

初创公司

startup companies

塔姆尔项目

Tamr project

PRS2系统

PRS2 system

标点流团队

Punctuated Streams Team

用于 Postgres 生产化的 Purify 工具

Purify tool for Postgres productionization

佛朗哥·普佐卢

Putzolu, Franco

PVLDB 2016论文

PVLDB 2016 paper

卡塔尔计算研究所 (QCRI)

Qatar Computing Research Institute (QCRI)

创建

creation

数据文明者项目

Data Civilizer project

数据驯服者

Data Tamer

塔姆尔项目

Tamr project

奎尔语

Quel language

评论

comments

复杂的物体

complex objects

描述

description

概述

overview

和标准化

and standardization

实用命令

utility commands

H-Store 中的查询类

Query classes in H-Store

Ingres 中的查询分解

Query decomposition in Ingres

查询执行

Query execution

便利店

C-Store

H店

H-Store

BigDAWG 中的查询建模和优化

Query modeling and optimization in BigDAWG

Ingres 中的查询修改

Query modification in Ingres

便利店查询优化

Query optimization in C-Store

规则系统中的查询重写实现

Query rewrite implementation in rules systems

安静(朋友)

Quiet (friend)

R-Tree索引结构

R-Tree index structure

安格尔

Ingres

和 Postgres

and Postgres

Postgres 事务存储极其简单

Radical simplicity for Postgres transactional storage

RAID 存储架构

RAID storage architectures

为初创公司筹集资金

Raising money for startup companies

RAP项目

RAP project

稀有项目

RARES project

亚历克斯·拉辛

Rasin, Alex

关于极光项目

on Aurora project

美国无线电公司

RCA company

克里斯·雷

Ré, Chris

读取优化的系统

Read-optimized systems

实时要求作为 DBMS 的限制

Real-time requirements as DBMS limitation

经验法则对现实世界的影响

Real-world impact in rules of thumb

Data Tamer 项目中的重复记录删除

Record deduplication in Data Tamer project

记录未来公司

Recorded Future company

恢复

Recovery

便利店

C-Store

数据库日志

database logs for

H店

H-Store

安格尔

Ingres

Postgres

Postgres

红砖系统

Red Brick Systems

重做日志

Redo logs

H店

H-Store

联机事务处理设计

OLTP design

“随机链大规模马尔可夫模型的简化”论文(Stonebraker)

“Reduction of Large Scale Markov Models for Random Chains” dissertation (Stonebraker)

参照完整性

Referential integrity

在 Ingres 中重新格式化元组

Reformatting tuples in Ingres

关系-CODASYL 争论

Relational-CODASYL debate

关系数据库产业诞生

Relational database industry birth

安格尔竞赛

Ingres competition

安格尔团队

Ingres team

安格尔计时

Ingres timing

成熟阶段

maturity stage

概述

overview

关系数据库管理系统 (RDBMS)

Relational database management systems (RDBMS)

行业诞生时间表

industry birth timeline

安格尔晚年

Ingres later years

关系数据库简史

Relational databases, brief history of

关系时代

Relational era

单一的关系模型并不适合所有世界

Relational models in one size does not fit all world

关系

Relations

安格尔

Ingres

奎尔

QUEL

远程直接内存访问 (RDMA)

Remote direct memory access (RDMA)

交会系统

Rendezvous system

安格尔中的 REPLACE 命令

REPLACE command in Ingres

OLTP 中的复制管理

Replication management in OLTP

研究、开源的影响

Research, open source impact on

应用于国家需求(RANN)计划的研究

Research Applied to the National Needs (RANN) program

研究贡献

Research contributions

2010 年代及以后

2010s and beyond

伯克利年

Berkeley years

麻省理工学院

MIT

一种尺寸并不适合所有时代

one size doesn’t fit all era

交战技术规则

technical rules of engagement

Aurora 项目的研究故事

Research story about Aurora project

VoltDB 中的驻留集大小 (RSS)

Resident set size (RSS) in VoltDB

OLTP设计中的资源控制

Resource control in OLTP design

Ingres 中的检索命令

RETRIEVE commands in Ingres

评价,不满意

Reviews, unsatisfactory

RFID(射频识别)标签

RFID (radio frequency identification) tagging

丹·里斯

Ries, Dan

里弗斯,乔纳森

Rivers, Jonathan

约翰·罗宾逊“JR”

Robinson, John “JR”

无共享乐队

Shared Nothing band

塔姆尔

Tamr

维蒂卡系统公司

Vertica Systems

摇滚乐

Rock fetches

Rogers、Jennie、BigDAWG Polystore 系统文章

Rogers, Jennie, BigDAWG polystore system article

便利店的回滚

Rollbacks in C-Store

OLTP 表中的根

Roots in OLTP tables

66 号公路电视节目

Route 66 TV show

行店架构

Row store architecture

罗,拉里

Rowe, Larry

商业安格尔代码线

commercial Ingres codeline

安格尔创立

Ingres founding

Postgres

Postgres

Postgres 实施开创性工作

Postgres implementation seminal work

RTI 成立

RTI founding

C-Store 中的 RS 列存储

RS column store in C-Store

RTI(关系技术公司)

RTI (Relational Technology, Inc.)

商业版

commercial version

创立

founding

安格尔基础

Ingres basis of

布拉德·鲁宾斯坦

Rubenstein, Brad

Ruby-on-Rails 系统

Ruby-on-Rails system

Postgres 中的规则系统

Rules systems in Postgres

复杂

complexity

执行效率

implementation efficiency

介绍

introduction

知识管理

knowledge management

推动

push for

第二系统

second system

意见

views

S店项目

S-Store project

初创公司销售指南

Sales in startup company guidelines

一刀切的销售问题

Sales problem in one size fits all

乔恩·萨尔茨

Salz, Jon

砂拉瓦木, 苏尼塔

Sarawagi, Sunita

ScaLAPACK 分析

ScaLAPACK analytics

Tamr 中的缩放

Scaling in Tamr

汉斯·谢克

Schek, Hans

Data Tamer 项目中的架构映射

Schema mapping in Data Tamer project

贝尔尼·席弗

Schieffer, Berni

凯利·施兰布

Schlamb, Kelly

马克·施赖伯

Schreiber, Mark

数据文明者

Data Civilizer

塔姆尔

Tamr

海登·舒尔茨

Schultz, Hayden

斯图·舒斯特

Schuster, Stu

SciDB 代码线

SciDB codeline

连接性

connectivity

特色焦点

features focus

基因组数据

genomic data

硬数字

hard numbers

语言

languages

安全

security

科学数据库项目

SciDB project

捐款给

contributions for

描述

description

一种尺寸并不适合所有时代

one size doesn’t fit all era

科学的数据管理

Scientific data management

开始任务

beginning tasks

当前用户

current users

第一批用户

first users

后勤

logistics

山区代表

mountain representation

规划

planning

一种适合所有人的科学数据库

Scientific databases in one size fits all

BigDAWG polystore 系统中的范围

Scope in BigDAWG polystore system

SDTM(研究数据列表模型)

SDTM (Study Data Tabulation Model)

以一种适合所有人的方式进行搜索

Search in one size fits all

搜索用户界面(赫斯特)

Search User Interfaces (Hearst)

第二系统效应

Second System Effect

二级指标

Secondary indexes

初创公司指南的保密性

Secrecy in startup company guidelines

SciDB 中的安全性

Security in SciDB

便利店的细分

Segments in C-Store

选择便利店运营商

Select operators in C-Store

自资企业

Self-funded companies

便利店自助订购价值

Self-order values in C-Store

帕特·塞林格

Selinger, Pat

基于传感器的应用以一种尺寸适合所有人

Sensor-based applications in one size fits all

续集语言

SEQUEL language

石布雷克的服务

Service of Stonebraker

米罗团队的吉姆·尚克兰

Shankland, Jim, on Miro team

SciDB 代码线中的无共享架构

Shared-nothing architecture in SciDB codeline

无共享乐队

Shared Nothing band

Sharma、Kristi Sen,SciDB 代码线文章

Sharma, Kristi Sen, SciDB codeline article

她,佐浩(Jack)

She, Zuohao (Jack)

BigDAWG Polystore 系统中的垫片

Shims in BigDAWG polystore system

Shore(可扩展异构对象存储库)

Shore (Scalable Heterogeneous Object Repository)

建筑学

architecture

原型

prototype

移除组件

removing components

岸存储管理器 (SSM)

Shore Storage Manager (SSM)

短一(朋友)

Short One (friend)

艾德·西布利

Sibley, Ed

SIGFIDET会议

SIGFIDET conference

歌手 Adam 参与 Aurora 项目

Singer, Adam, on Aurora project

H-Store项目中的单分区交易

Single-partition transactions in H-Store project

单点交易

Single-sited transactions

OLTP 中的单线程

Single threading in OLTP

戴尔·斯基恩

Skeen, Dale

大卫·斯科克

Skok, David

瞌睡猫启动

Sleepycat startup

斯隆数字巡天 (SDSS)

Sloan Digital Sky Survey (SDSS)

Shore 中的开槽页面

Slotted pages in Shore

SMALLTALK语言

SMALLTALK language

光滑(朋友)

Smooth (friend)

快照隔离

Snapshot isolation

里克·斯诺德格拉斯

Snodgrass, Rick

软件对 Postgres 的影响

Software impact in Postgres

征集初创公司指南

Solicitation in startup company guidelines

C-Store 中的排序键和运算符

Sort keys and operators in C-Store

Ingres 项目的源代码

Source code for Ingres project

Ingres 中的 SOURCE 目录

SOURCE directory in Ingres

便利店空间预算

Space budget in C-Store

扳手

Spanner

火花

Spark

初创公司的花钱指南

Spending money guidelines for startup companies

雪碧分布式操作系统

Sprite distributed OS

SQL语言

SQL language

介绍

introduction

映射减少

MapReduce

一种尺寸并不适合全世界

one size does not fit all world

和 Postgres

and Postgres

对阵奎尔

vs. Quel

SQuAl系统

SQuAl system

Postgres 中的稳定内存

Stable memory in Postgres

斯坦福直线加速器 (SLAC) 设施

Stanford Linear Accelerator (SLAC) facility

数据仓库中的星型模式

Star schema in data warehousing

星爆计划

Starburst project

初创公司成立

Startup companies, founded

初创公司、指南

Startup companies, guidelines

团队的商业头脑

business acumen on team

公司控制权

company control

第一批客户

first customers

想法

ideas

知识产权

intellectual property

介绍

introduction

灯塔客户

lighthouse customers

错误

mistakes

融资演讲稿和风险投资征集

pitch deck and VC solicitation

筹集资金

raising money

销售量

sales

保密

secrecy

花钱

spending money

概括

summary

团队和原型

teams and prototypes

风险投资家

venture capitalists

初创公司,正在运行

Startup companies, running

介绍

introduction

概述

overview

伙伴关系

partnerships

人们在

people in

实用主义

pragmatism

一种尺寸的状态存储适合所有人

State storage in one size fits all

Postgres 中的状态

Status in Postgres

无菌事务类

Sterile transaction classes

斯通布雷克,贝丝

Stonebraker, Beth

莱斯利·斯通布雷克

Stonebraker, Leslie

迈克尔·斯通布雷克

Stonebraker, Michael

文集

collected works

失败文章

failures article

想法文章

ideas article

Postgres 设计、构建和商业化故事

Postgres design, construction, and commercialization story

初创公司指南。查看初创公司、指南

startup company guidelines. See Startup companies, guidelines

温斯莱特采访

Winslett interview

迈克尔·斯通布雷克,传记概述

Stonebraker, Michael, biography overview

学术生涯和安格尔的诞生

academic career and birth of Ingres

学术职位

academic positions

宣传

advocacy

奖项和荣誉

awards and honors

职业流程图

career flowchart

公司成立

companies founded

早年和教育

early years and education

工业、麻省理工学院和新千年

industry, MIT, and new millennium

遗产

legacy

个人生活

personal life

后安格尔时代

post-Ingres years

麻省理工学院的休假

sabbatical at MIT

学生家谱图

student genealogy chart

概要

synopsis

迈克尔·斯通布雷克的开创性作品

Stonebraker, Michael, seminal works

便利店

C-Store

建筑时代的终结

end of architectural era

安格尔实施

Ingres implementation

OLTP数据库

OLTP databases

一种尺寸适合所有人

one size fits all

Postgres 实现

Postgres implementation

Stonebraker 的好主意

Stonebraker’s good ideas

抽象数据类型

abstract data types

数据驯服者

Data Tamer

数据仓库

data warehouses

分布式数据库

distributed databases

H 存储/VoltDB

H-Store/VoltDB

如何利用

how to exploit

安格尔

Ingres

并行数据库

parallel databases

Postgres

Postgres

初创公司指南

startup company guidelines

桑德拉·斯通布雷克

Stonebraker, Sandra

C-Store 中的存储分配器

Storage allocators in C-Store

便利店存储密钥

Storage keys in C-Store

存储管理和结构

Storage management and structures

便利店

C-Store

商业安格尔代码线

commercial Ingres codeline

安格尔

Ingres

Postgres

Postgres

奎尔

QUEL

存储过程

Stored procedures

OVQP 中的策略计划

STRATEGY program in OVQP

流处理时代

Stream processing era

极光和北极光的起源

Aurora and Borealis origins

Aurora 和 Borealis 系统

Aurora and Borealis systems

同时努力

concurrent efforts

当前系统

current systems

流库系统

StreamBase Systems

流处理一应俱全

Stream processing in one size fits all

流项目

STREAM project

Stream-SQL,热情

Stream-SQL, enthusiasm for

流团队

STREAM Team

StreamBase 代码线

StreamBase codelines

愚人节笑话

April Fool’s Day joke

结论

conclusion

顾客

customers

发展

development

问题

issues

流库系统

StreamBase Systems

架构委员会

Architecture Committee

聚合系统

aggregation systems

从奥罗拉

from Aurora

创立

founding

草溪更名为

Grassy Brook renamed to

文本语言

textual language

强两相应用

Strongly two-phase applications

学生家谱图

Student genealogy chart

学生视角

Student perspective

塔马尔的主题专家 (SME)

Subject matter experts (SMEs) in Tamr

收入和计划参与调查 (SIPP) 数据

Survey of Income and Program Participation (SIPP) data

赛贝斯

Sybase

一种尺寸的同步适合所有人

Synchronization in one size fits all

系统综合体耦合设施

Sysplex Coupling Facility

Ingres 系统目录

System catalogs in Ingres

系统

System

系统级数据管理问题及方法

System-level data management problems and approaches

系统R系统

System R system

建筑特色

architectural features

代码库

code base

发展

development

对阵安格尔

vs. Ingres

系统、领导力和宣传

Systems, leadership and advocacy

亚历克斯·萨莱

Szalay, Alex

彼得·索洛维茨

Szolovits, Peter

T 树

T-trees

H-Store 中的表片段

Table fragments in H-Store

高大鲨鱼(朋友)

Tall Shark (friend)

Tamr 代码线

Tamr codeline

算法复杂度

algorithmic complexity

结论

conclusion

数据统一

Data Unification

用户强调

user emphasis

种类

variety

Tamr项目和公司

Tamr project and company

创建

creation

来自数据驯服者

from Data Tamer

创立

founding

想法

idea for

原型

prototype

串联计算机

Tandem Computers

唐楠

Tang, Nan

数据文明者

Data Civilizer

数据文明者文章

Data Civilizer article

探戈,乔

Tango, Jo

在高地活动中

at Highland event

风险资本家视角文章

venture capitalist perspective article

伊戈尔·塔拉尚斯基

Tarashansky, Igor

内西姆·塔特布尔

Tatbul, Nesime

Aurora/Borealis/StreamBase 代码线文章

Aurora/Borealis/StreamBase codelines article

Aurora/Borealis/StreamBase 重聚

Aurora/Borealis/StreamBase reunion

米罗团队的泰勒·西马龙

Taylor, Cimarron, on Miro team

初创公司团队

Teams for startup companies

参与的技术规则

Technical rules of engagement

技术许可办公室 (TLO)

Technology Licensing Offices (TLOs)

电报队

Telegraph Team

电报CQ项目

TelegraphCQ project

特勒诺夫公司

Telenauv company

Data Civilizer 中的时间函数依赖性

Temporal Functional Dependencies in Data Civilizer

终身教职论文要求

Tenure paper requirements

泰拉数据

Teradata

初创公司指南中的条款清单

Term sheets in startup company guidelines

安格尔终端监视器

Terminal monitor in Ingres

历经考验的奖项

Test-of-time award

一种适合所有人的文本搜索

Text search in one size fits all

汤森路透 (TR) 公司

Thomson Reuters (TR) company

Shore 中的线程支持

Thread support in Shore

三维问题

Three dimensional problems

OLTP 吞吐量

Throughput in OLTP

理查德·蒂贝茨

Tibbetts, Richard

Aurora/Borealis/StreamBase 重聚

Aurora/Borealis/StreamBase reunion

StreamBase开发

StreamBase development

StreamBase问题

StreamBase issues

流库系统

StreamBase Systems

TIBCO 软件公司

TIBCO Software, Inc.

Postgres 中的时间旅行功能

Time travel feature in Postgres

Postgres 中的 TimeSeries DataBlade

TimeSeries DataBlade in Postgres

便利商店中的时间戳权威 (TA)

Timestamp authorities (TAs) in C-Store

时代十系统

TimesTen system

安格尔中的 TMP 目录

TMP directory in Ingres

TPC(事务处理性能委员会)基准测试

TPC (Transaction Processing Performance Council) benchmark

数据统一

Data Unification

H店

H-Store

联机事务处理

OLTP

TPC-B

TPC-B

便利店的培训工作量

Training workloads in C-Store

奥马尔·特拉吉曼

Trajman, Omer

Tran、Nga、C-Store 开创性作品

Tran, Nga, C-Store seminal work

无事务数据库

Transaction-less databases

交易

Transactions

便利店

C-Store

并发控制。请参阅并发控制

concurrency control. See Concurrency control

特征

features

H店

H-Store

联机事务处理

OLTP

回滚

rollbacks

图式特征

schema characteristics

Postgres 中的传递闭包

Transitive closure in Postgres

树模式

Trees schemas

触发器

Triggers

数据库管理系统的限制

DBMS limitation

一种尺寸适合所有人

one size fits all

Postgres

Postgres

规则系统

rules systems

三重摇滚(朋友)

Triple Rock (friend)

与风险投资家的信任

Trust with venture capitalists

丹尼斯·齐克里兹

Tsichritzis, Dennis

元组移动者

Tuple movers

便利店

C-Store

维蒂卡系统公司

Vertica Systems

VoltDB 中的元组存储

Tuple storage in VoltDB

安格尔中的元组

Tuples in Ingres

急性心肌梗死

AMI

代换

substitution

TID

TIDs

变量

variables

2014年图灵奖

Turing Award in 2014

引文

citation

概述

overview

观点

perspectives

两相应用

Two-phase applications

Postgres 中的类型

Types in Postgres

迈克尔·乌贝尔

Ubell, Michael

伊洛斯特拉

Illustra

米罗团队

Miro team

Postgres 生产化

Postgres productionization

撤消日志

Undo logs

H店

H-Store

联机事务处理设计

OLTP design

UNIN进程结构

UNIN process structure

Postgres 中的联合类型

Union types in Postgres

Unix平台

Unix platforms

安格尔

Ingres

系统基于

systems based on

更新

Updates

便利店

C-Store

安格尔

Ingres

厄普通(朋友)

Uptone (friend)

城市动态

Urban Dynamics

城市系统

Urban systems

Postgres 中的用户定义聚合 (UDA) 函数

User-Defined Aggregate (UDA) functions in Postgres

用户定义的扩展 (UDX)

User-defined extensions (UDXs)

安格尔原型

Ingres prototype

SciDB 代码线

SciDB codeline

Postgres 中的用户定义函数 (UDF)

User-defined functions (UDFs) in Postgres

商业 Ingres 代码线中的用户定义类型 (UDT)

User-defined types (UDTs) in commercial Ingres codeline

Tamr 中的用户强调

User emphasis in Tamr

Tamr 中的用户体验 (UX) 设计和实现

User experience (UX) design and implementation in Tamr

安格尔的用户反馈

User feedback for Ingres

Ingres 中的实用命令

Utility commands in Ingres

范德普拉斯,杰克

VanderPlas, Jake

普拉文·瓦莱亚

Varaiya, Pravin

塔姆尔的多样性

Variety in Tamr

风险投资家

Venture capitalists

看法

perspective

在初创公司指南中

in startup company guidelines

维里斯克健康

Verisk Health

维尔尼卡,稀有

Vernica, Rares

维蒂卡系统公司

Vertica Systems

来自便利店

from C-Store

创建

creation

创立

founding

惠普购买

HP purchase of

的影响

impact of

专利侵权诉讼

patent infringement suit

满意

satisfaction with

塔姆尔

Tamr

风险资本家的视角

venture capitalist perspective

Vertica系统代码线

Vertica Systems codeline

架构决策

architectural decisions

建筑

building

结论

conclusion

顾客

customers

特色讨论

features discussion

视频数据管理

Video data management

越南战争

Vietnam war

意见

Views

安格尔

Ingres

规则系统

rules systems

文森特·蒂姆

Vincent, Tim

H-Store 原型的 VLDB 演示论文

VLDB demo paper for H-Store prototype

VLSI CAD设计时代

VLSI CAD design era

经验之声(朋友)

Voice-of-Experience (friend)

伏特数据库

VoltDB

创建

creation

来自 H 店

from H-Store

H店执行人

H-Store executor

H-Store 拆分

H-Store split

贝宝的兴趣

PayPal interest in

VoltDB 代码线

VoltDB codeline

压实

compaction

磁盘持久化

disk persistence

潜伏

latency

塔姆尔体积

Volume in Tamr

OLTP中的弱一致性

Weak consistency in OLTP

伟红优化器

Wei Hong Optimizer

阿里尔·韦斯伯格

Weisberg, Ariel

鲸鱼

Whales

惠特尼,凯文

Whitney, Kevin

安德鲁·惠特克

Whittaker, Andrew

尼克·怀特

Whyte, Nick

维多姆,詹妮弗

Widom, Jennifer

温纳,迈克

Winer, Mike

WinFS项目

WinFS project

温斯莱特、玛丽安接受 Stonebraker 采访

Winslete, Marianne, interview with Stonebraker

威斯康星州,1996 年秋季

Wisconsin, 1996 fall

威斯康星州基准

Wisconsin Benchmark

黄尤金

Wong, Eugene

安格尔

Ingres

安格尔创立

Ingres founding

Ingres 实施开创性工作

Ingres implementation seminal work

RTI 成立

RTI founding

碎石者 指导者

Stonebraker guided by

H-Store 工人工地

Worker sites in H-Store

基于工作流的图解语言

Workflow-based diagrammatic languages

OLTP 中的工作负载

Workload in OLTP

Postgres 中的预写日志记录

Write-ahead logging in Postgres

写入优化系统

Write-optimized systems

C-Store 中的 WS 列存储

WS column store in C-Store

肖敏

Xiao, Min

有机数据库

OMDB

维蒂卡系统公司

Vertica Systems

邢颖谈Aurora项目

Xing, Ying, on Aurora project

一种通用的 XML 数据库

XML databases in one size fits all

Postgres 中的 XPRS 架构

XPRS architecture in Postgres

XQuery语言

XQuery language

XRM-扩展(N 元)关系内存

XRM-An Extended (N-ary) Relational Memory

Yan Robin 谈 Aurora 项目

Yan, Robin, on Aurora project

卡雷尔·尤瑟菲

Youssefi, Karel

安格尔团队

Ingres team

串联计算机

Tandem Computers

于安德鲁

Yu, Andrew

Postgres 解析器

Postgres parser

PostgreSQL

PostgreSQL

SQL化项目

SQLization project

于凯瑟琳

Yu, Katherine

卡洛·扎尼奥洛

Zaniolo, Carlo

斯坦·兹多尼克

Zdonik, Stan

Aurora/Borealis/StreamBase 重聚

Aurora/Borealis/StreamBase reunion

极光项目

Aurora project

北欧化工项目

Borealis project

专家采购

expert sourcing

H店项目

H-Store project

无共享乐队

Shared Nothing band

流处理时代文章

stream processing era article

流库系统

StreamBase Systems

塔姆尔项目

Tamr project

维蒂卡系统公司

Vertica Systems

零十亿美元的想法

Zero-billion-dollar ideas

张东辉

Zhang, Donghui

斯蒂芬·齐勒斯

Zilles, Stephen

比尔·祖克

Zook, Bill

传记

Biographies

编辑

Editor

迈克尔·L·布罗迪

Michael L. Brodie

图像

迈克尔·L·布罗迪在数据库、分布式系统、集成、人工智能和多学科问题解决方面拥有超过 45 年的研究和工业实践经验。Brodie 博士是麻省理工学院计算机科学与人工智能实验室的研究科学家;为初创公司提供建议;在国家和国际研究组织的顾问委员会任职;是爱尔兰国立大学高威分校和悉尼科技大学的兼职教授。作为 Verizon 的 IT 首席科学家 20 多年,他负责 IT 战略的先进技术、架构和方法,并指导新兴技术的工业规模部署。他曾在多个国家科学院委员会任职。目前的兴趣包括大数据、数据科学和信息系统演化。布罗迪博士拥有博士学位。多伦多大学数据库博士和科学博士(爱尔兰国立大学荣誉学位)。请访问www.Michaelbrodie.com了解更多信息。

Michael L. Brodie has over 45 years of experience in research and industrial practice in databases, distributed systems, integration, artificial intelligence, and multidisciplinary problem-solving. Dr. Brodie is a research scientist at the Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology; advises startups; serves on advisory boards of national and international research organizations; and is an adjunct professor at the National University of Ireland, Galway and at the University of Technology, Sydney. As Chief Scientist of IT at Verizon for over 20 years, he was responsible for advanced technologies, architectures, and methodologies for IT strategies and for guiding industrial-scale deployments of emerging technologies. He has served on several National Academy of Science committees. Current interests include Big Data, Data Science, and Information Systems evolution. Dr. Brodie holds a Ph.D. in databases from the University of Toronto and a Doctor of Science (honoris causa) from the National University of Ireland. Visit www.Michaelbrodie.com for further information.

作者

Authors

丹尼尔·J·阿巴迪

Daniel J. Abadi

图像

丹尼尔·J·阿巴迪是马里兰大学帕克分校计算机科学系 Darnell-Kanal 教授。他对数据库系统架构和实现进行研究,特别是在可扩展和分布式系统的交叉点。他最出名的是 C-Store(面向列的数据库)原型的存储和查询执行引擎的开发,该原型由 Vertica 商业化并最终被惠普收购,以及他对容错可扩展 HadoopDB 的研究分析数据库系统,由 Hadapt 商业化,并于 2014 年被 Teradata 收购。Abadi 曾获得丘吉尔奖学金、NSF 职业奖、斯隆研究奖学金、VLDB 最佳论文奖、VLDB 10 年最佳论文奖,2008年SIGMOD Jim Gray博士论文奖,2013-2014 年耶鲁大学教务长教学奖和 2013 年 VLDB 早期职业研究员奖。他获得了博士学位。2008年从麻省理工学院毕业。他在 DBMS Musings 上发表博客(http://dbmsmusings.blogspot.com)和推文@daniel_abadi。

Daniel J. Abadi is the Darnell-Kanal Professor of Computer Science at the University of Maryland, College Park. He performs research on database system architecture and implementation, especially at the intersection of scalable and distributed systems. He is best known for the development of the storage and query execution engines of the C-Store (column-oriented database) prototype, which was commercialized by Vertica and eventually acquired by Hewlett-Packard, and for his HadoopDB research on fault-tolerant scalable analytical database systems, which was commercialized by Hadapt and acquired by Teradata in 2014. Abadi has been a recipient of a Churchill Scholarship, a NSF CAREER Award, a Sloan Research Fellowship, a VLDB Best Paper Award, a VLDB 10-year Best Paper Award, the 2008 SIGMOD Jim Gray Doctoral Dissertation Award, the 2013–2014 Yale Provost’s Teaching Prize, and the 2013 VLDB Early Career Researcher Award. He received his Ph.D. in 2008 from MIT. He blogs at DBMS Musings (http://dbmsmusings.blogspot.com) and Tweets at @daniel_abadi.

玛格达莱娜·巴拉津斯卡

Magdalena Balazinska

图像

Magdalena Balazinska是华盛顿大学 Paul G. Allen 计算机科学与工程学院的教授,也是该大学电子科学研究所的主任。她还是 IGERT 大数据和数据科学博士项目以及相关高级数据科学博士选项的主任。她的研究兴趣是数据库管理系统,目前的重点是数据科学、大数据系统和云计算的数据管理。马格达莱娜拥有博士学位。来自麻省理工学院(2006 年)。她是 Microsoft 研究院新任研究员 (2007 年),并获得了首届 VLDB 数据库研究女性奖 (2016 年)、ACM SIGMOD 时间考验奖 (2017 年)、NSF 职业奖 (2009 年)、10 年最具影响力论文奖(2010 年)、Google 研究奖(2011 年)、惠普实验室研究创新奖(2009 年和 2010 年)、罗格尔教员支持奖(2006 年)、微软研究院研究生奖学金(2003-2005 年)以及多项最佳论文奖。

Magdalena Balazinska is a professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington and is the director of the University’s eScience Institute. She’s also director of the IGERT PhD Program in Big Data and Data Science and the associated Advanced Data Science PhD Option. Her research interests are in database management systems with a current focus on data management for data science, big data systems, and cloud computing. Magdalena holds a Ph.D. from the Massachusetts Institute of Technology (2006). She is a Microsoft Research New Faculty Fellow (2007) and received the inaugural VLDB Women in Database Research Award (2016), an ACM SIGMOD Test-of-Time Award (2017), an NSF CAREER Award (2009), a 10-year most influential paper award (2010), a Google Research Award (2011), an HP Labs Research Innovation Award (2009 and 2010), a Rogel Faculty Support Award (2006), a Microsoft Research Graduate Fellowship (2003–2005), and multiple best-paper awards.

尼古拉斯·贝茨故居

Nikolaus Bates-Haus

图像

Nikolaus Bates-Haus是 Tamr Inc.(一家企业级数据统一公司)的技术主管,他组建了原始工程团队并领导了第一代产品的开发。在加入 Tamr 之前,Nik 是 Endeca(2011 年被 Oracle 收购)的首席架构师和工程总监,在那里他领导了 MDEX 分析数据库引擎的开发,这是一种专为大规模并行查询评估而设计的读取时模式列存储。此前,Nik 曾在 Torrent Systems、Thinking Machines 和 Philips Research North America 从事数据集成、机器学习、并行计算和实时处理工作。Nik 拥有哥伦比亚大学计算机科学硕士学位和卫斯理大学数学/计算机科学学士学位。Tamr 是 Nik 的第七家初创公司。

Nikolaus Bates-Haus is Technical Lead at Tamr Inc., an enterprise-scale data unification company, where he assembled the original engineering team and led the development of the first generation of the product. Prior to joining Tamr, Nik was Lead Architect and Director of Engineering at Endeca (acquired by Oracle in 2011), where he led development of the MDEX analytical database engine, a schema-on-read column store designed for large-scale parallel query evaluation. Previously, Nik worked in data integration, machine learning, parallel computation, and real-time processing at Torrent Systems, Thinking Machines, and Philips Research North America. Nik holds an M.S. in Computer Science from Columbia University and a B.A. in Mathematics/Computer Science from Wesleyan University. Tamr is Nik’s seventh startup.

菲利普·伯恩斯坦

Philip A. Bernstein

图像

菲利普·伯恩斯坦是 Microsoft Research 的杰出科学家,已工作 20 多年。他还是华盛顿大学计算机科学系的副教授。在过去的 20 年里,他曾担任 Microsoft 和 Digital Equipment Corp. 的产品架构师、哈佛大学和王研究生院的教授以及 Sequoia Systems 的软件副总裁。他发表了 150 多篇关于数据库系统理论和实现的论文和 2 本书,特别是在事务处理和数据集成方面,并为多种数据库产品做出了贡献。他是 ACM 院士、美国科学促进会院士、ACM SIGMOD 科德创新奖获得者、华盛顿州科学院院士和美国国家工程院院士。他获得了康奈尔大学的学士学位和硕士学位。和博士学位。

Philip A. Bernstein is a Distinguished Scientist at Microsoft Research, where he has worked for over 20 years. He is also an Affiliate Professor of Computer Science at the University of Washington. Over the last 20 years, he has been a product architect at Microsoft and Digital Equipment Corp., a professor at Harvard University and Wang Institute of Graduate Studies, and a VP Software at Sequoia Systems. He has published over 150 papers and 2 books on the theory and implementation of database systems, especially on transaction processing and data integration, and has contributed to a variety of database products. He is an ACM Fellow, an AAAS Fellow, a winner of ACM SIGMOD’s Codd Innovations Award, a member of the Washington State Academy of Sciences, and a member of the U.S. National Academy of Engineering. He received a B.S. from Cornell and M.Sc. and Ph.D. degrees from the University of Toronto.

珍妮丝·L·布朗

Janice L. Brown

图像

Janice L. Brown是 Janice Brown & Associates, Inc.(一家传播咨询公司)的总裁兼创始人。她利用战略沟通帮助企业家和有远见的思想家创办技术公司、产品和企业,以及销售他们的产品和想法。她与 2014 年图灵奖获得者 Michael Stonebraker 共同参与了三个项目(迄今为止):Vertica Systems、Tamr 和英特尔大数据科技中心。她的背景包括在多家公共关系和广告公司担任职位,以及在两家大型科技公司担任产品公关职位。她为开放软件基金会所做的工作赢得了 PRSA 的银砧奖,被誉为公关行业的“奥斯卡”。布朗拥有西蒙斯学院学士学位。访问www.janicebrown.com

Janice L. Brown is president and founder of Janice Brown & Associates, Inc., a communications consulting firm. She uses strategic communications to help entrepreneurs and visionary thinkers launch technology companies, products, and ventures, as well as sell their products and ideas. She has been involved in three ventures (so far) with 2014 Turing Award-winner Michael Stonebraker: Vertica Systems, Tamr, and the Intel Science and Technology Center for Big Data. Her background includes positions at several public relations and advertising agencies, and product PR positions at two large technology companies. Her work for the Open Software Foundation won the PRSA’s Silver Anvil Award, the “Oscar” of the PR industry. Brown has a B.A. from Simmons College. Visit www.janicebrown.com.

保罗·布朗

Paul Brown

图像

保罗·布朗 (Paul Brown)于 1992 年初在加利福尼亚州伯克利欧几里德大道的 Brewed Awakening 咖啡店第一次见到了迈克·斯通布雷克 (Mike Stonebraker)。迈克和约翰·福雷斯特正在面试保罗,以接替迈克·奥尔森刚刚离开的工作。保罗喝了一杯拿铁咖啡。迈克喝了茶。从那时起,Paul 一直在 Mike 的两家初创公司工作:Illustra Information Technologies 和 SciDB / Paradigm4。他与迈克合着了一本书和多篇研究论文。Paul 曾在一系列以字母“I”开头的 DBMS 公司工作过:Ingres、Illustra、Informix 和 IBM。由于厌倦,Paul 加入 Paradigm4,担任 SciDB 的首席架构师。此后,他转到 Teradata 工作。Paul 喜欢狗、DBMS 和 (void *)。他希望自己刚刚在这个行业中获得了足够的尊严,可以拔掉胡子。

Paul Brown first met Mike Stonebraker in early 1992 at Brewed Awakening coffee shop on Euclid Avenue in Berkeley, CA. Mike and John Forrest were interviewing Paul to take over the job Mike Olson had just left. Paul had a latte. Mike had tea. Since then, Paul has worked for two of Mike’s startups: Illustra Information Technologies and SciDB / Paradigm4. He was co-author with Mike of a book and a number of research papers. Paul has worked for a series of DBMS companies all starting with the letter “I”: Ingres, Illustra, Informix, and IBM. Alliterative ennui setting in, Paul joined Paradigm4 as SciDB’s Chief Architect. He has since moved on to work for Teradata. Paul likes dogs, DBMSs, and (void *). He hopes he might have just picked up sufficient gravitas in this industry to pull off the beard.

保罗·巴特沃斯

Paul Butterworth

图像

Paul Butterworth于 1980 年至 1990 年间担任 Ingres 的首席系统架构师。他目前是 VANTIQ, Inc. 的联合创始人兼首席技术官 (CTO)。他过去的职位包括 You Technology Inc. 的工程执行副总裁以及 Emotive Communications 的联合创始人兼 CTO,他在该公司构想并设计了 Emotive企业移动计算云平台。在此之前,Paul 是 Oracle 的架构师以及 AmberPoint 的创始人兼首席技术官,负责指导 AmberPoint SOA 治理产品的技术战略。在加入 AmberPoint 之前,Paul 是 Sun Microsystems 开发人员工具组的杰出工程师和首席技术专家,也是 Forte Software 的创始人、首席架构师和高级副总裁。Paul 拥有加州大学欧文分校计算机科学学士学位和研究生学位。

Paul Butterworth served as Chief Systems Architect at Ingres from 1980–1990. He is currently co-founder and Chief Technology Officer (CTO) at VANTIQ, Inc. His past roles include Executive Vice President, Engineering at You Technology Inc., and co-founder and CTO of Emotive Communications, where he conceived and designed the Emotive Cloud Platform for enterprise mobile computing. Before that, Paul was an architect at Oracle and a founder & CTO at AmberPoint, where he directed the technical strategy for the AmberPoint SOA governance products. Prior to AmberPoint, Paul was a Distinguished Engineer and Chief Technologist for the Developer Tools Group at Sun Microsystems and a founder, Chief Architect, and Senior Vice President of Forte Software. Paul holds undergraduate and graduate degrees in Computer Science from UC Irvine.

迈克尔·J·凯里

Michael J. Carey

图像

迈克尔·J·凯里在卡内基梅隆大学获得学士和硕士学位,在加州大学伯克利分校获得博士学位。分别于 1979 年、1981 年和 1983 年获得加州大学伯克利分校博士学位。他目前是加州大学欧文分校 (UCI) 信息和计算机科学系的布伦教授以及 Couchbase, Inc. 的咨询架构师。在 2008 年加入 UCI 之前,Mike 在 BEA Systems 工作了七年,并领导了 BEA 的开发用于虚拟数据集成的 AquaLogic 数据服务平台产品。他还在威斯康星大学麦迪逊分校任教了十几年,在 IBM Almaden 研究中心从事对象关系数据库工作五年,并在电子商务平台初创公司 Propel Software 工作了一年半。 2000-2001 年互联网泡沫。他是 ACM 院士、IEEE 院士、美国国家工程院院士,ACM SIGMOD EF Codd 创新奖获得者。他目前的兴趣集中在数据密集型计算和可扩展数据管理(又名大数据)。

Michael J. Carey received his B.S. and M.S. from Carnegie-Mellon University and his Ph.D. from the University of California, Berkeley, in 1979, 1981, and 1983, respectively. He is currently a Bren Professor of Information and Computer Sciences at the University of California, Irvine (UCI) and a consulting architect at Couchbase, Inc. Before joining UCI in 2008, Mike worked at BEA Systems for seven years and led the development of BEA’s AquaLogic Data Services Platform product for virtual data integration. He also spent a dozen years teaching at the University of Wisconsin-Madison, five years at the IBM Almaden Research Center working on object-relational databases, and a year and a half at Propel Software, an e-commerce platform startup, during the infamous 2000–2001 Internet bubble. He is an ACM Fellow, an IEEE Fellow, a member of the National Academy of Engineering, and a recipient of the ACM SIGMOD E.F. Codd Innovations Award. His current interests center on data-intensive computing and scalable data management (a.k.a. Big Data).

弗雷德·卡特

Fred Carter

图像

弗雷德·卡特是多个软件领域的软件架构师,曾在 Ingres Corporation 担任多个高级职位,包括首席科学家/首席架构师。他目前是 VANTIQ, Inc. 的首席架构师。在加入 VANTIQ 之前,Fred 是 AmberPoint 的运行时架构师,该公司随后被 Oracle 收购。在 Oracle,他继续担任该职务,将 AmberPoint 系统迁移到基于云的应用程序性能监控服务。过去的职位包括 Forte 的 EAI 产品架构师(继续在 Sun Microsystems)和 Oracle 的技术领导职位,在那里他为交互式电视、在线服务和内容管理设计了分布式对象服务,并担任对象定义联盟技术委员会主席促进基于网络的多媒体系统领域的标准化。

Fred Carter , a software architect in a variety of software areas, worked at Ingres Corporation in several senior positions, including Principal Scientist/Chief Architect. He is currently a principal architect at VANTIQ, Inc. Prior to VANTIQ, Fred was the runtime architect for AmberPoint, which was subsequently purchased by Oracle. At Oracle, he continued in that role, moving the AmberPoint system to a cloud-based, application performance monitoring service. Past roles included architect for EAI products at Forte (continuing at Sun Microsystems) and technical leadership positions at Oracle, where he designed distributed object services for interactive TV, online services, and content management, and chaired the Technical Committee for the Object Definition Alliance to foster standardization in the area of network-based multimedia systems. Fred has an undergraduate degree in Computer Science from Northwestern University and received his M.S. in Computer Science from UC Berkeley.

劳尔·卡斯特罗·费尔南德斯

Raul Castro Fernandez

图像

Raul Castro Fernandez是麻省理工学院的博士后,与 Samuel Madden 和 Michael Stonebraker 合作研究数据发现,即如何帮助人们在数据库、数据湖和云中查找相关数据。Raul 构建了 Aurum,一个数据发现系统,用于识别结构化数据中的相关数据集。在其他研究领域中,他正在研究如何合并非结构化数据源,例如 PDF 和电子邮件。更广泛地说,他对数据相关问题感兴趣,从高效数据处理到机器学习工程。在麻省理工学院之前,劳尔完成了博士学位。在伦敦帝国理工学院,他专注于设计新的抽象和构建用于大规模数据处理的系统。

Raul Castro Fernandez is a postdoc at MIT, working with Samuel Madden and Michael Stonebraker on data discovery—how to help people find relevant data among databases, data lakes, and the cloud. Raul built Aurum, a data discovery system, to identify relevant data sets among structured data. Among other research lines, he is looking at how to incorporate unstructured data sources, such as PDFs and emails. More generally, he is interested in data-related problems, from efficient data processing to machine learning engineering. Before MIT, Raul completed his Ph.D. at Imperial College London, where he focused on designing new abstractions and building systems for large-scale data processing.

乌古尔·切廷泰梅尔

Ugur Çetintemel

图像

Ugur Çetintemel是布朗大学计算机科学系教授。他的研究方向是高性能、用户友好的数据管理和处理系统的设计和工程,这些系统允许用户交互式地分析大型数据集。Ugur 担任 SIGMOD '09 主席,并担任VLDB Journal、Distributed and Parallel DatabasesSIGMOD Record的编辑委员会成员。他曾获得美国国家科学基金会职业奖和 IEEE 数据工程 10 年时间测试奖等。Ugur 是 StreamBase 的联合创始人和高级架构师,StreamBase 是一家专门从事高性能数据处理的公司。他还是布朗曼宁大学的助理教授,并自 2014 年 7 月起担任布朗大学计算机科学系系主任。

Ugur Çetintemel is a professor in the department of Computer Science at Brown University. His research is on the design and engineering of high-performance, user-friendly data management and processing systems that allow users to analyze large data sets interactively. Ugur chaired SIGMOD ’09 and served on the editorial boards of VLDB Journal, Distributed and Parallel Databases, and SIGMOD Record. He is the recipient of a National Science Foundation Career Award and an IEEE 10-year test of time award in Data Engineering, among others. Ugur was a co-founder and a senior architect of StreamBase, a company that specializes in high-performance data processing. He was also a Brown Manning Assistant Professor and has been serving as the Chair of the Computer Science Department at Brown since July 2014.

陈学东

Xuedong Chen

Xudong Chen目前是马萨诸塞州安多弗的Amazon.com Web 服务软件开发人员。2002年至2007年,他获得博士学位。麻省大学波士顿分校的候选人,由帕特和贝蒂·奥尼尔建议。他与帕特·奥尼尔 (Pat O'Neil) 等人与迈克·斯通布雷克 (Mike Stonebraker) 共同创作。

Xuedong Chen is currently an Amazon.com Web Services software developer in Andover, Massachusetts. From 2002–2007 he was a Ph.D. candidate at UMass Boston, advised by Pat and Betty O’Neil. He, along with Pat O’Neil and others, were coauthors with Mike Stonebraker.

米奇·切尔尼亚克

Mitch Cherniack

图像

米奇·切尔尼亚克 (Mitch Cherniack)是布兰代斯大学的副教授。他曾获得 NSF 职业奖,也是 Vertica Systems 和 StreamBase Systems 的联合创始人。他在数据库系统方面的研究重点是查询优化、流数据系统和基于列的数据库架构。米奇获得博士学位。1999 年获得布朗大学学士学位,1992 年获得康科迪亚大学硕士学位,1992 年获得康考迪亚大学硕士学位。1984年获得麦吉尔大学博士学位。

Mitch Cherniack is an Associate Professor at Brandeis University. He is a previous winner of an NSF Career Award and co-founder of Vertica Systems and StreamBase Systems. His research in Database Systems has focused on query optimization, streaming data systems, and column-based database architectures. Mitch received his Ph.D. from Brown University in 1999, an M.S. from Concordia University in 1992, and a B.Ed. from McGill University in 1984.

大卫·J·德威特

David J. DeWitt

图像

大卫·J·德威特获得博士学位后,于 1976 年 9 月加入威斯康星大学计算机科学系。来自密歇根大学。他于 1999 年 7 月至 2004 年 7 月期间担任系主任。2008 年从威斯康星大学退休时,他担任约翰·P·莫格里奇计算机科学教授。2008 年,他以技术研究员的身份加入微软,建立并管理麦迪逊的吉姆·格雷系统实验室。2016年,他搬到波士顿,加入麻省理工学院计算机科学和人工智能实验室,担任兼职教授。DeWitt 教授是美国国家工程院院士(1998 年)、美国艺术与科学学院院士(2007 年)和 ACM 院士(1995 年)。他荣获 1995 年 Ted Codd SIGMOD 创新奖。

David J. DeWitt joined the Computer Sciences Department at the University of Wisconsin in September 1976 after receiving his Ph.D. from the University of Michigan. He served as department chair from July 1999 to July 2004. He held the title of John P. Morgridge Professor of Computer Sciences when he retired from the University of Wisconsin in 2008. In 2008, he joined Microsoft as a Technical Fellow to establish and manage the Jim Gray Systems Lab in Madison. In 2016, he moved to Boston to join the MIT Computer Science and AI Laboratory as an Adjunct Professor. Professor DeWitt is a member of the National Academy of Engineering (1998), a fellow of the American Academy of Arts and Sciences (2007), and an ACM Fellow (1995). He received the 1995 Ted Codd SIGMOD Innovations Award. His pioneering contributions to the field of scalable database systems for “big data” were recognized by ACM with the 2009 Software Systems Award.

亚伦·J·埃尔莫尔

Aaron J. Elmore

图像

Aaron J. Elmore是芝加哥大学计算机科学系和学院的助理教授。亚伦 (Aaron) 此前是麻省理工学院的博士后助理,与迈克·斯通布雷克 (Mike Stonebraker) 和萨姆·马登 (Sam Madden) 合作。Aaron 关于数据库即服务的弹性原语的论文是在加州大学圣巴巴拉分校 Divy Agrawal 和 Amr El Abbadi 的监督下完成的。在获得博士学位之前,Aaron 在工业界工作了几年,并在芝加哥大学获得了硕士学位。

Aaron J. Elmore is an assistant professor in the Department of Computer Science and the College of the University of Chicago. Aaron was previously a postdoctoral associate at MIT working with Mike Stonebraker and Sam Madden. Aaron’s thesis on Elasticity Primitives for Database-as-a-Service was completed at the University of California, Santa Barbara under the supervision of Divy Agrawal and Amr El Abbadi. Prior to receiving a Ph.D., Aaron spent several years in industry and completed an M.S. at the University of Chicago.

米格尔·费雷拉

Miguel Ferreira

米格尔·费雷拉 (Miguel Ferreira)是麻省理工学院的校友。他与 Samuel Madden 和 Daniel Abadi 合作撰写了《Integrating Compression and Execution in Column-Oriented Database Systems》论文,以及与 Mike Stonebraker、Daniel Abadi 和 Daniel Abadi 合作撰写的《C-store:面向列的 DBMS》其他的。

Miguel Ferreira is an alumnus of MIT. He was coauthor of the paper, “Integrating Compression and Execution in Column-Oriented Database Systems,” while working with Samuel Madden and Daniel Abadi, and “C-store: A Column-Oriented DBMS,” with Mike Stonebraker, Daniel Abadi, and others.

维杰·加德帕利

Vijay Gadepally

图像

Vijay Gadepally是麻省理工学院 (MIT) 林肯实验室的高级技术人员,与计算机科学与人工智能实验室 (CSAIL) 密切合作。Vijay 拥有硕士学位。和博士学位。他拥有俄亥俄州立大学电气和计算机工程学士学位以及坎普尔印度理工学院电气工程学士学位。2011 年,Vijay 获得俄亥俄州立大学杰出研究生奖。2016 年,Vijay 获得了麻省理工学院林肯实验室的早期职业技术成就奖,并于 2017 年入选 AFCEA 首届 40 岁以下 40 人名单。Vijay 的研究兴趣是高性能计算、机器学习、图算法和高性能数据库。

Vijay Gadepally is a senior member of the technical staff at the Massachusetts Institute of Technology (MIT) Lincoln Laboratory and works closely with the Computer Science and Artificial Intelligence Laboratory (CSAIL). Vijay holds an M.Sc. and Ph.D. in Electrical and Computer Engineering from The Ohio State University and a B.Tech in Electrical Engineering from the Indian Institute of Technology, Kanpur. In 2011, Vijay received an Outstanding Graduate Student Award at The Ohio State University. In 2016, Vijay received the MIT Lincoln Laboratory’s Early Career Technical Achievement Award and in 2017 was named to AFCEA’s inaugural 40 under 40 list. Vijay’s research interests are in high-performance computing, machine learning, graph algorithms, and high-performance databases.

纳比尔·哈赫姆

Nabil Hachem

图像

纳比尔·哈赫姆现任 MassMutual 副总裁兼数据架构、技术和标准主管。他曾担任诺华生物医学研究所数据工程全球主管。他还曾在 Vertica Systems, Inc.、Infinity Pharmaceuticals、Upromise Inc.、Fidelity Investments Corp. 和 Ask Jeeves Inc. 担任高级数据工程职位。他的职业生涯是在黎巴嫩一家数据电信公司担任电气工程师和运营部门经理。除了商业生涯之外,纳比尔还在伍斯特理工学院教授计算机科学。他与人合着了数十篇有关科学数据库、文件结构和连接算法等方面的论文。Nabil 获得贝鲁特美国大学电气工程学位,并获得博士学位。雪城大学计算机工程专业。

Nabil Hachem is currently Vice President, Head of Data Architecture, Technology, and Standards at MassMutual. He was formerly Global Head of Data Engineering at Novartis Institute for Biomedical Research, Inc. He also held senior data engineering posts at Vertica Systems, Inc., Infinity Pharmaceuticals, Upromise Inc., Fidelity Investments Corp., and Ask Jeeves Inc. Nabil began his career as an electrical engineer and operations department manager for a data telecommunications firm in Lebanon. In addition to his commercial career, Nabil taught computer science at Worcester Polytechnic Institute. He co-authored dozens of papers on scientific databases, file structures, and join algorithms, among others. Nabil received a degree in Electrical Engineering from the American University of Beirut and earned his Ph.D. in Computer Engineering from Syracuse University.

唐·哈德勒

Don Haderle

图像

唐·哈德勒1968 年作为软件开发人员加入 IBM,并于 2005 年作为软件主管退休,担任信息管理首席技术官 (CTO)。他为风险投资家提供咨询并为初创企业提供建议。他目前担任多家公司的技术顾问委员会成员并独立提供咨询服务。他被认为是商业高性能、工业级关系数据库系统之父,在​​ 1977 年至 1998 年间担任 DB2 的技术领导者和首席架构师。他领导了 DB2 的整体架构和开发,对所有关键要素做出了重要的个人贡献并拥有基础专利,包括:日志记录原语、内存管理、事务故障保存和恢复技术、查询处理、数据完整性、排序和索引。作为首席技术官,Haderle 与研究人员合作,为信息管理行业培育新的产品方向。Don 于 1989 年被任命为 IBM 院士,并于 1991 年被任命为先进技术副总裁;2000 年被任命为 ACM 院士;2008 年当选为美国国家工程院院士。他毕业于加州大学伯克利分校(1967 年经济学学士)。

Don Haderle joined IBM in 1968 as a software developer and retired in 2005 as the software executive operating as Chief Technology Officer (CTO) for Information Management. He consulted with venture capitalists and advised startups. He currently sits on technical advisory boards for a number of companies and consults independently. Considered the father of commercial high-performance, industrial-strength relational database systems, he was the technical leader and chief architect of DB2 from 1977–1998. He led DB2’s overall architecture and development, making key personal contributions to and holding fundamental patents in all key elements, including: logging primitives, memory management, transaction fail-save and recovery techniques, query processing, data integrity, sorting, and indexing. As CTO, Haderle collaborated with researchers to incubate new product directions for the information management industry. Don was appointed an IBM Fellow in 1989 and Vice President of Advanced Technology in 1991; named an ACM Fellow in 2000; and elected to the National Academy of Engineering in 2008. He is a graduate of UC Berkeley (B.A., Economics, 1967).

詹姆斯·汉密尔顿

James Hamilton

图像

James Hamilton是 Amazon Web Services 团队的副总裁兼杰出工程师,他专注于基础设施效率、可靠性和扩展性。他在大规模服务、数据库管理系统和编译器领域工作了 20 多年。在加入 AWS 之前,James 是 Microsoft 数据中心未来团队和 Windows Live 平台服务团队的架构师。他曾担任 Microsoft Exchange 托管服务团队的总经理,并领导许多 SQL Server 工程团队发布了多个版本。在加入 Microsoft 之前,James 是 IBM DB2 UDB 团队的首席架构师。他拥有理学学士学位。维多利亚大学计算机科学学士学位和滑铁卢大学数学、计算机科学硕士学位。

James Hamilton is Vice President and Distinguished Engineer on the Amazon Web Services team, where he focuses on infrastructure efficiency, reliability, and scaling. He has spent more than 20 years working on high-scale services, database management systems, and compilers. Prior to joining AWS, James was architect on the Microsoft Data Center Futures team and the Windows Live Platform Services team. He was General Manager of the Microsoft Exchange Hosted Services team and has led many of the SQL Server engineering teams through numerous releases. Before joining Microsoft, James was Lead Architect on the IBM DB2 UDB team. He holds a B.Sc. inComputer Science from the University of Victoria and a Master’s in Math, Computer Science from the University of Waterloo.

斯塔夫罗斯·哈里佐普洛斯

Stavros Harizopoulos

图像

Stavros Harizopoulos目前是 Facebook 的软件工程师,负责领导实时分析项目。在此之前,他是 AWS Redshift(云中 PB 级柱状数据仓库)的首席工程师,负责领导性能和可扩展性方面的工作。2011年,他与他人共同创立了Amiato,这是一项完全托管的实时ETL云服务,后来被亚马逊收购。过去,Stavros 曾在 HP 实验室和 MIT CSAIL 担任研究科学家职位,致力于表征数据库服务器的能源效率,以及剖析现代内存和列存储数据库的性能特征。他是卡内基梅隆大学的博士。和 Y Combinator 校友。

Stavros Harizopoulos is currently a Software Engineer at Facebook, where he leads initiatives on Realtime Analytics. Before that, he was a Principal Engineer at AWS Redshift, a petabyte-scale columnar Data Warehouse in the cloud, where he was leading efforts on performance and scalability. In 2011, he co-founded Amiato, a fully managed real-time ETL cloud service, which was later acquired by Amazon. In the past, Stavros has held research-scientist positions at HP Labs and MIT CSAIL, working on characterizing the energy efficiency of database servers, as well as dissecting the performance characteristics of modern in-memory and column-store databases. He is a Carnegie Mellon Ph.D. and a Y Combinator alumnus.

马蒂·赫斯特

Marti Hearst

图像

Marti Hearst是加州大学伯克利分校信息学院和 EECS 系的教授。她曾是施乐帕洛阿尔托研究中心的研究人员,并获得了博士学位。来自加州大学伯克利分校计算机科学系。她的主要研究兴趣是搜索引擎的用户界面、信息可视化、自然语言处理和改善教育。她的书《搜索用户界面》是学术界第一本此类书。Hearst 教授于 2013 年被任命为 ACM 院士,并于 2017 年被任命为 CHI 学院成员,并且担任计算语言学协会主席。她曾四次获得由学生发起的卓越教学奖。

Marti Hearst is a professor in the School of Information and the EECS Department at UC Berkeley. She was formerly a member of the research staff at Xerox PARC and received her Ph.D. from the CS Division at UC Berkeley. Her primary research interests are user interfaces for search engines, information visualization, natural language processing, and improving education. Her book Search User Interfaces was the first of its kind in academics. Prof. Hearst was named a Fellow of the ACM in 2013 and a member of the CHI Academy in 2017, and is president of the Association for Computational Linguistics. She has received four student-initiated Excellence in Teaching Awards.

杰里·霍尔德

Jerry Held

图像

杰里·霍尔德40 多年来一直是一位成功的硅谷企业家、高管和投资者。他管理过公司的所有成长阶段,从概念到价值数十亿美元的全球企业。他目前担任 Tamr 和 Madaket Health 的董事长,并担任 NetApp、Informatica 和 Copia Global 的董事会成员。他过去的董事会职务包括担任 Vertica Systems 和 MemSQL 的执行主席以及 Business Objects 的首席独立董事。此前,Held 博士曾担任风险投资公司 Kleiner Perkins Caufield & Byers 的“常驻首席执行官”。他曾担任 Oracle 公司服务器产品部门的高级副总裁,也是将 Tandem Computers 从尚未盈利的公司发展成为价值数十亿美元的公司的执行团队成员。在许多其他角色中,他领导了容错、无共享、和横向扩展关系数据库系统。他获得了博士学位。他在加州大学伯克利分校获得计算机科学博士学位,领导了 Ingres 关系数据库管理系统的初步开发。

Jerry Held has been a successful Silicon Valley entrepreneur, executive, and investor for over 40 years. He has managed all growth stages of companies, from conception to multi-billion-dollar global enterprise. He is currently chairman of Tamr and Madaket Health and serves on the boards of NetApp, Informatica, and Copia Global. His past board service includes roles as executive chairman of Vertica Systems and MemSQL and lead independent director of Business Objects. Previously, Dr. Held was “CEO-in-residence” at venture capital firm Kleiner Perkins Caufield & Byers. He was senior vice president of Oracle Corporation’s server product division and a member of the executive team that grew Tandem Computers from pre-revenue to multi-billion-dollar company. Among many other roles, he led pioneering work in fault-tolerant, shared-nothing, and scale-out relational database systems. He received his Ph.D. in Computer Science from the University of California, Berkeley, where he led the initial development of the Ingres relational database management system.

帕特·海兰

Pat Helland

图像

帕特·海兰自 1978 年以来,他一直致力于构建数据库、事务系统、分布式系统、消息传递系统、多处理器硬件和可扩展云系统。在 Tandem Computers,他是 NonStop SQL 事务引擎的首席架构师。在 Microsoft,他设计了 Microsoft Transaction Server、分布式事务协调器、SQL Service Broker,并改进了 Cosmos 大数据基础架构,以包括优化数据库功能以及 PB 级事务正确事件处理。在亚马逊工作期间,帕特为 Dynamo 最终一致性商店和产品目录的设计做出了贡献。Pat 于 1973 年至 1976 年就读于加州大学欧文分校,并入选首届加州大学欧文分校信息与计算机科学名人堂。

Pat Helland has been building databases, transaction systems, distributed systems, messaging systems, multiprocessor hardware, and scalable cloud systems since 1978. At Tandem Computers, he was Chief Architect of the transaction engine for NonStop SQL. At Microsoft, he architected Microsoft Transaction Server, Distributed Transaction Coordinator, SQL Service Broker, and evolved the Cosmos big data infrastructure to include optimizing database features as well as petabyt-scale transactionally correct event processing. While at Amazon, Pat contributed to the design of the Dynamo eventually consistent store and also the Product Catalog. Pat attended the University of California, Irvine from 1973–1976 and was in the inaugural UC Irvine Information and Computer Science Hall of Fame. Pat chairs the Dean’s Leadership Council of the Donald Bren School of Information and Computer Sciences (ICS), UC Irvine.

约瑟夫·海勒斯坦

Joseph M. Hellerstein

图像

Joseph M. Hellerstein是加州大学伯克利分校计算机科学系的 Jim Gray 教授,他的工作重点是以数据为中心的系统及其驱动计算的方式。他是 ACM 研究员、Alfred P. Sloan 研究员,并因其研究获得了三项 ACM-SIGMOD“时间考验”奖。2010年,《财富》杂志将他列入科技领域最聪明的50人名单,麻省理工学院的《科技评论》也将他列入其中杂志将他的工作列入了“最有可能改变我们的世界”的 10 项技术的 TR10 列表中。Hellerstein 是 Trifacta 的联合创始人兼首席战略官,Trifacta 是一家软件供应商,为混乱的数据问题提供智能交互解决方案。他是 Dell EMC、SurveyMonkey、Captricity 和 Datometry 等多家计算和互联网公司的技术顾问委员会成员,此前曾担任英特尔伯克利分校研究总监。

Joseph M. Hellerstein is the Jim Gray Professor of Computer Science at the University of California, Berkeley, whose work focuses on data-centric systems and the way they drive computing. He is an ACM Fellow, an Alfred P. Sloan Research Fellow, and the recipient of three ACM-SIGMOD “Test of Time” awards for his research. In 2010, Fortune Magazine included him in their list of 50 smartest people in technology, and MIT’s Technology Review magazine included his work on their TR10 list of the 10 technologies “most likely to change our world.” Hellerstein is the co-founder and Chief Strategy Officer of Trifacta, a software vendor providing intelligent interactive solutions to the messy problem of wrangling data. He serves on the technical advisory boards of a number of computing and Internet companies including Dell EMC, SurveyMonkey, Captricity, and Datometry, and previously served as the Director of Intel Research, Berkeley.

魏红

Wei Hong

图像

Wei Hong是 Google 数据基础设施和分析 (DIA) 小组的工程总监,负责流数据处理领域,包括为 Google 广告和商务领域一些最关键的收入管道构建和维护基础设施。在加入 Google 之前,他与 Mike Stonebraker 共同创立并领导了三家初创公司:数据库系统领域的 Illustra 和 Cohera 以及物联网领域的 Arch Rock。他还曾在 Informix、PeopleSoft、Cisco 和 Nest 担任高级工程领导职务。他是英特尔伯克利研究院的高级研究员,致力于传感器网络和流数据库系统,并获得了 ACM SIGMOD Test of Time 奖。他是 80 项专利的共同发明人。他获得了博士学位。拥有加州大学伯克利分校学士学位和清华大学硕士学位、学士学位和学士学位。

Wei Hong is an engineering director in Google’s Data Infrastructure and Analysis (DIA) group, responsible for the streaming data processing area including building and maintaining the infrastructure for some of Google’s most revenue-critical data pipelines in Ads and Commerce. Prior to joining Google, he co-founded and led three startup companies: Illustra and Cohera with Mike Stonebraker in database systems and Arch Rock in Internet of Things. He also held senior engineering leadership positions at Informix, PeopleSoft, Cisco, and Nest. He was a senior researcher at Intel Research Berkeley working on sensor networks and streaming database systems and won an ACM SIGMOD Test of Time Award. He is a co-inventor of 80 patents. He received his Ph.D. from UC Berkeley and hos ME, BE, and BS from Tsinghua University.

约翰·哈格

John Hugg

图像

约翰·哈格 (John Hugg)非常热爱与数据相关的问题。他曾在三个数据库产品初创公司工作过,也曾在大型组织内解决过数据库问题。尽管 John 在研究生院涉足统计学,但 Stonebraker 博士使用新生的 VoltDB 项目引诱他回到数据库领域。与非常特别的 VoltDB 团队合作是一个无与伦比的学习和接受挑战的机会。John 于 2007 年获得塔夫茨大学硕士学位,并于 2005 年获得学士学位。

John Hugg has had a deep love for problems relating to data. He’s worked at three database product startups and worked on database problems within larger organizations as well. Although John dabbled in statistics in graduate school, Dr. Stonebraker lured him back to databases using the nascent VoltDB project. Working with the very special VoltDB team was an unmatched opportunity to learn and be challenged. John received an M.S in 2007 and a B.S. in 2005 from Tufts University.

伊哈布·伊利亚斯

Ihab Ilyas

图像

Ihab Ilyas是滑铁卢大学 Cheriton 计算机科学学院的教授,主要研究方向为大数据和数据库系统领域,特别关注数据质量和集成、管理不确定数据、排名感知查询处理和信息提取。Ihab 也是 Tamr 的联合创始人,这是一家专注于大规模数据集成和清理的初创公司。他是安大略省早期研究员奖(2009 年)、Cheriton 教员奖学金(2013 年)、NSERC 发现加速器奖(2014 年)和 Google 教员奖(2014 年)的获得者,并且是 ACM 杰出科学家。Ihab 当选为 VLDB 捐赠基金董事会成员、当选为 SIGMOD 副主席以及ACM Transactions on Database Systems的副主编(托德斯)。他拥有博士学位。普渡大学计算机科学学士学位和理学士学位。和硕士学位 来自亚历山大大学。

Ihab Ilyas is a professor in the Cheriton School of Computer Science at the University of Waterloo, where his main research focuses on the areas of big data and database systems, with special interest in data quality and integration, managing uncertain data, rank-aware query processing, and information extraction. Ihab is also a co-founder of Tamr, a startup focusing on large-scale data integration and cleaning. He is a recipient of the Ontario Early Researcher Award (2009), a Cheriton Faculty Fellowship (2013), an NSERC Discovery Accelerator Award (2014), and a Google Faculty Award (2014), and he is an ACM Distinguished Scientist. Ihab is an elected member of the VLDB Endowment board of trustees, elected SIGMOD vice chair, and an associate editor of ACM Transactions on Database Systems (TODS). He holds a Ph.D. in Computer Science from Purdue University and a B.Sc. and an M.Sc. from Alexandria University.

贾森·金钦

Jason Kinchen

图像

Jason Kinchen是 Paradigm4 的工程副总裁,是一位软件专业人士,在向生命科学、汽车、航空航天和其他工程市场提供高度复杂的产品方面拥有 30 多年的经验。他是项目生命周期各个方面(从可行性分析到需求、功能设计到交付和增强)领导技术团队的专家,并且在开发质量驱动流程、改进软件开发生命周期和推动战略规划方面经验丰富。杰森是一位狂热的自行车手,也是红十字救灾行动小组的志愿者。

Jason Kinchen, Paradigm4’s V.P. of Engineering, is a software professional with over 30 years’ experience in delivering highly complex products to life science, automotive, aerospace, and other engineering markets. He is an expert in leading technical teams in all facets of a project life cycle from feasibility analysis to requirements to functional design to delivery and enhancement, and experienced in developing quality-driven processes improving the software development life cycle and driving strategic planning. Jason is an avid cyclist and a Red Cross disaster action team volunteer.

摩西·托夫·克雷普斯

Moshe Tov Kreps

Moshe Tov Kreps(原名 Peter Kreps)是加州大学伯克利分校和劳伦斯伯克利国家实验室的前研究员。他与 Mike Stonebraker、Eugene Wong 和 Gerald Held 共同撰写了开创性论文“INGRES 的设计和实现”,该论文于 1976 年 9 月发表在 ACM Transactions on Database Systems 上。

Moshe Tov Kreps (formerly known as Peter Kreps) is a former researcher at the University of California at Berkeley and the Lawrence Berkeley National Laboratory. He was coauthor, with Mike Stonebraker, Eugene Wong, and Gerald Held, of the seminal paper, “The Design and Implementation of INGRES,” published in the ACM Transactions on Database Systems in September 1976.

刘爱文

Edmond Lau

图像

Edmond Lau是 Co Leadership 的联合创始人,他的使命是将工程师转变为领导者。他开展领导力体验、为期数周的项目和在线课程,帮助人们实现他们梦想的生活和职业。他是《高效工程师》一书的作者,该书现在已成为许多工程团队事实上的入职指南。他的职业生涯是在硅谷领导 Quip、Quora、Google 和 Ooyala 的工程团队。作为一名领导力教练,Edmond 还直接与首席技术官、董事、经理和其他新兴领导者合作,以释放他们的潜力。《纽约时报》、《福布斯》、《时代》、《Slate, Inc.》、《财富》和《连线》均对埃德蒙进行了专题报道。他在coleadership.com上发表博客,拥有一个网站 ( www.the effectiveengineer.com ),并在 @edmondlau 发推文。

Edmond Lau is the co-founder of Co Leadership, where his mission is to transform engineers into leaders. He runs leadership experiences, multi-week programs, and online courses to bridge people from where they are to the lives and careers they dream of. He’s the author of The Effective Engineer, the now the de facto onboarding guide for many engineering teams. He’s spent his career leading engineering teams across Silicon Valley at Quip, Quora, Google, and Ooyala. As a leadership coach, Edmond also works directly with CTO’s, directors, managers, and other emerging leaders to unlock what’s possible for them. Edmond has been featured in the New York Times, Forbes, Time, Slate, Inc., Fortune, and Wired. He blogs at coleadership.com, has a website (www.theeffectiveengineer.com), and tweets at @edmondlau.

希尔帕·拉万德

Shilpa Lawande

图像

希尔帕·拉万德是 Postscript .us 的首席执行官兼联合创始人,Postscript .us 是一家人工智能初创公司,其使命是将医生从临床文书工作中解放出来。此前,她曾担任 HPE 大数据平台副总裁/总经理,包括其旗舰产品 Vertica 分析平台。Shilpa 是 Vertica 的创始工程师,领导其工程和客户成功团队从初创公司一直到公司被惠普收购。Shilpa 拥有多项关于数据仓库的专利和书籍,并于 2012 年入选 Mass High Tech Women to Watch 名单,并于 2015 年入选 Rev Boston 20。Shilpa 担任 Tamr 的顾问,并在两项教育计划中担任导师/志愿者、Year Up(波士顿)和 CSPathshala(印度)。Shilpa 拥有威斯康星大学麦迪逊分校计算机科学硕士学位和孟买印度理工学院计算机科学与工程学士学位。

Shilpa Lawande is CEO and co-founder of postscript .us, an AI startup on a mission to free doctors from clinical paperwork. Previously, she was VP/GM HPE Big Data Platform, including its flagship Vertica Analytics Platform. Shilpa was a founding engineer at Vertica and led its Engineering and Customer Success teams from startup through the company’s acquisition by HP. Shilpa has several patents and books on data warehousing to her name, and was named to the 2012 Mass High Tech Women to Watch list and Rev Boston 20 in 2015. Shilpa serves as an advisor at Tamr, and as mentor/volunteer at two educational initiatives, Year Up (Boston) and CSPathshala (India). Shilpa has a M.S. in Computer Science from the University of Wisconsin-Madison and a B.S in Computer Science and Engineering from the Indian Institute of Technology, Mumbai.

林艾默生

Amerson Lin

图像

Amerson Lin于 2005 年在麻省理工学院获得了计算机科学学士学位和工程硕士学位。他回到新加坡,在军队和政府部门任职,然后返回软件领域。他曾担任 Pivotal 的顾问,后来担任 Palantir 在新加坡和美国的业务开发主管。Amerson 目前经营着自己的保险科技初创公司 Gigacover,该公司为东南亚提供数字保险。

Amerson Lin received his B.S. and M.Eng both in Computer Science at MIT, the latter in 2005. He returned to Singapore to serve in the military and government before returning to the world of software. He was a consultant at Pivotal and then a business development lead at Palantir in both Singapore and the U.S. Amerson currently runs his own Insurtech startup—Gigacover—which delivers digital insurance to Southeast Asia.

塞缪尔·马登

Samuel Madden

图像

塞缪尔·马登是麻省理工学院计算机科学和人工智能实验室的电气工程和计算机科学教授。他的研究兴趣包括数据库、分布式计算和网络。他以在传感器网络、面向列的数据库、高性能事务处理和云数据库方面的工作而闻名。马登获得博士学位。2003 年,他在加州大学伯克利分校获得博士学位,负责开发用于从传感器网络收集数据的 TinyDB 系统。Madden 被《技术评论》评选为 35 岁以下 35 强之一(2005 年),并获得多项奖项,包括 NSF 职业奖(2004 年)、斯隆基金会奖学金(2007 年)、VLDB 最佳论文奖(2004 年、2007 年)、以及 MobiCom 2006 最佳论文奖。

Samuel Madden is a professor of Electrical Engineering and Computer Science in MIT’s Computer Science and Artificial Intelligence Laboratory. His research interests include databases, distributed computing, and networking. He is known for his work on sensor networks, column-oriented database, high-performance transaction processing, and cloud databases. Madden received his Ph.D. in 2003 from the University of California at Berkeley, where he worked on the TinyDB system for data collection from sensor networks. Madden was named one of Technology Review’s Top 35 Under 35 (2005), and is the recipient of several awards, including an NSF CAREER Award (2004), a Sloan Foundation Fellowship (2007), VLDB best paper awards (2004, 2007), and a MobiCom 2006 best paper award. He also received “test of time” awards in SIGMOD 2013 and 2017 (for his work on Acquisitional Query Processing in SIGMOD 2003 and on Fault Tolerance in the Borealis system in SIGMOD 2007), and a ten-year best paper award in VLDB 2015 (for his work on the C-Store system).

蒂姆·马特森

Tim Mattson

图像

蒂姆·马特森是一个并行程序员。他获得了博士学位。因其在分子散射理论方面的工作而获得加州大学圣克鲁斯分校化学博士学位。自 1993 年以来,Tim 一直在英特尔公司工作,致力于高性能计算:软件(OpenMP、OpenCL、RCCE 和 OCR)和硬件/软件协同设计(ASCI Red、80 核 TFLOP 芯片和48 核 SCC)。Tim 的学术合作包括并行编程的基本设计模式、BigDAWG Polystore 系统、TileDB 阵列存储管理器以及“线性代数语言”图形构建块(GraphBLAS)的工作。目前,他领导着英特尔的一个研究团队,致力于研究帮助应用程序员编写在未来并行系统上运行的高度优化的代码的技术。在计算之外,蒂姆用沿海皮划艇打发时间。他是 ACA 认证的皮划艇教练(5 级,高级开放海洋)和教练培训师(3 级,基础沿海)。

Tim Mattson is a parallel programmer. He earned his Ph.D. in Chemistry from the University of California, Santa Cruz for his work in molecular scattering theory. Since 1993, Tim has been with Intel Corporation, where he has worked on High Performance Computing: both software (OpenMP, OpenCL, RCCE, and OCR) and hardware/software co-design (ASCI Red, 80-core TFLOP chip, and the 48 core SCC). Tim’s academic collaborations include work on the fundamental design patterns of parallel programming, the BigDAWG polystore system, the TileDB array storage manager, and building blocks for graphs “in the language of linear algebra” (the GraphBLAS). Currently, he leads a team of researchers at Intel working on technologies that help application programmers write highly optimized code that runs on future parallel systems. Outside of computing, Tim fills his time with coastal sea kayaking. He is an ACA-certified kayaking coach (level 5, advanced open ocean) and instructor trainer (level three, basic coastal).

菲利克斯·瑙曼

Felix Naumann

图像

Felix Naumann在柏林科技大学学习数学、经济学和计算机科学。他完成了博士学位。2000 年,他发表了题为“质量驱动的查询应答”的论文。2001 年和 2002 年,他在 IBM Almaden 研究中心从事数据集成主题的工作。2003年至2006年,他担任柏林洪堡大学信息集成助理教授。此后,他担任德国波茨坦大学哈索·普拉特纳研究所信息系统系主任。他是《Information Systems》的主编,研究兴趣包括数据剖析、数据清理和文本挖掘。

Felix Naumann studied Mathematics, Economics, and Computer Science at the University of Technology in Berlin. He completed his Ph.D. thesis on “Quality-driven Query Answering” in 2000. In 2001 and 2002, he worked at the IBM Almaden Research Center on topics of data integration. From 2003–2006, he was assistant professor for information integration at the Humboldt-University of Berlin. Since then, he has held the chair for information systems at the Hasso Plattner Institute at the University of Potsdam in Germany. He is Editor-in-Chief of Information Systems, and his research interests are in data profiling, data cleansing, and text mining.

迈克·奥尔森

Mike Olson

图像

Mike Olson于 2008 年联合创立了 Cloudera,并担任首席执行官,直到 2013 年接任目前的首席战略官 (CSO)。作为 CSO,Mike 负责 Cloudera 的产品战略、开源领导力、工程协调以及与客户的直接互动。在加入 Cloudera 之前,Mike 是 Sleepycat Software 的首席执行官,该公司是开源嵌入式数据库引擎 Berkeley DB 的制造商。2006 年 Oracle 收购 Sleepycat 后,Mike 在 Oracle 公司工作了两年,担任嵌入式技术副总裁。在加入 Sleepycat 之前,Mike 在数据库供应商 Britton Lee、Illustra Information Technologies 和 Informix Software 担任技术和业务职位。Mike 拥有加州大学伯克利分校计算机科学学士和硕士学位。迈克在推特上@mikeolson。

Mike Olson co-founded Cloudera in 2008 and served as its CEO until 2013 when he took on his current role of chief strategy officer (CSO). As CSO, Mike is responsible for Cloudera’s product strategy, open-source leadership, engineering alignment, and direct engagement with customers. Prior to Cloudera, Mike was CEO of Sleepycat Software, makers of Berkeley DB, the open-source embedded database engine. Mike spent two years at Oracle Corporation as Vice President for Embedded Technologies after Oracle’s acquisition of Sleepycat in 2006. Prior to joining Sleepycat, Mike held technical and business positions at database vendors Britton Lee, Illustra Information Technologies, and Informix Software. Mike has a B.S. and an M.S. in Computer Science from the University of California, Berkeley. Mike tweets at @mikeolson.

伊丽莎白·奥尼尔

Elizabeth O’Neil

Elizabeth O'Neil(贝蒂)是波士顿马萨诸塞大学计算机科学教授。她的重点是数据库引擎的研究、教学和软件开发:性能分析、事务、XML 支持、Unicode 支持、缓冲方法。除了在麻省大学波士顿分校的工作外,她还曾担任 Bolt、Beranek 和 Newman, Inc. 的长期(1977-1996)兼职高级科学家,并在两次休假期间担任全职科学家。微软公司顾问。她是微软拥有的两项专利的所有者。

Elizabeth O’Neil (Betty) is a Professor of Computer Science at the University of Massachusetts, Boston. Her focus is research, teaching, and software development in database engines: performance analysis, transactions, XML support, Unicode support, buffering methods. In addition to her work for UMass Boston, she was, among other pursuits, a long-term (1977–1996) part-time Senior Scientist for Bolt, Beranek, and Newman, Inc., and during two sabbaticals was a full-time consultant for Microsoft Corporation. She is the owner of two patents owned by Microsoft.

帕特里克·奥尼尔

Patrick O’Neil

帕特里克·奥尼尔是波士顿马萨诸塞大学名誉教授。他的研究重点是数据库系统成本性能、事务隔离、数据仓库、位图索引的变化以及多维数据库/OLAP。除了研究、教学和服务活动之外,他还与妻子 Elizabeth (Betty) 合着了一本数据库管理教科书,并一直积极致力于开发数据库性能基准和企业数据库咨询。他拥有多项专利。

Patrick O’Neil is Professor Emeritus at the University of Massachusetts, Boston. His research has focused on database system cost-performance, transaction isolation, data warehousing, variations of bitmap indexing, and multi-dimensional databases/OLAP. In addition to his research, teaching, and service activities, he is the coauthor—with his wife Elizabeth (Betty)—of a database management textbook, and has been active in developing database performance benchmarks and corporate database consulting. He holds several patents.

穆拉德·乌扎尼

Mourad Ouzzani

图像

Mourad Ouzzani是 HBKU 卡塔尔计算研究所的首席科学家。在加入 QCRI 之前,他是普渡大学的研究副教授。他目前的研究兴趣包括数据集成、数据清理以及构建大规模系统以支持科学和工程。他是 Rayyan 的首席 PI,Rayyan 是一个支持创建系统评论的系统,截至 2017 年 3 月,该系统拥有超过 11,000 名用户。他在 SIGMOD、PVLDB、ICDE 和 TKDE 等顶级场所发表了大量文章。他于 2009 年和 2012 年获得普渡大学成功种子奖。他拥有弗吉尼亚理工大学的硕士学位和阿尔及利亚 USTHB 的硕士学位和学士学位。

Mourad Ouzzani is a principal scientist with the Qatar Computing Research Institute, HBKU. Before joining QCRI, he was a research associate professor at Purdue University. His current research interests include data integration, data cleaning, and building large-scale systems to enable science and engineering. He is the lead PI of Rayyan, a system for supporting the creation of systematic reviews, which had more than 11,000 users as of March 2017. He has extensively published in top-tier venues including SIGMOD, PVLDB, ICDE, and TKDE. He received Purdue University Seed for Success Awards in 2009 and 2012. He received his Ph.D. from Virginia Tech and his M.S. and B.S. from USTHB, Algeria.

安迪·帕尔默

Andy Palmer

图像

Andy Palmer是 Tamr, Inc. 的联合创始人兼首席执行官,这是一家企业级数据统一公司,由他与连续创业者、2014 年图灵奖获得者 Michael Stonebraker 博士等人共同创立。此前,Palmer 是 Vertica Systems(与 Mike Stonebraker 一起)的联合创始人兼创始首席执行官,该公司是一家开创性的分析数据库公司(已被 HP 收购)。他创立了 Koa Labs,这是一家支持波士顿/剑桥创业生态系统的种子基金,是 The Founder Collective 的创始人兼合伙人,并在 MIT CSAIL 担任研究附属职位。在其企业家职业生涯中,Palmer 曾担任 60 多家科技、医疗保健和医疗保健领域初创公司的创始人、创始投资者、董事会成员或顾问。生命科学。他还曾担任诺华生物医学研究所 (NIBR) 软件和数据工程全球主管,以及 Infinity Pharmaceuticals(纳斯达克股票代码:INFI)初创团队成员兼首席信息和行政官。此前,他曾在创新技术公司 Bowstreet、 pcOrder.com和 Trilogy任职。他拥有鲍登学院 (Bowdoin) 学士学位 (1988 年) 和达特茅斯塔克商学院 (Tuck School of Business at Dartmouth) MBA (1994 年)。

Andy Palmer is co-founder and CEO of Tamr, Inc., the enterprise-scale data unification company that he founded with fellow serial entrepreneur and 2014 Turing Award winner Michael Stonebraker, Ph.D., and others. Previously, Palmer was co-founder and founding CEO of Vertica Systems (also with Mike Stonebraker), a pioneering analytics database company (acquired by HP). He founded Koa Labs, a seed fund supporting the Boston/Cambridge entrepreneurial ecosystem, is a founder-partner at The Founder Collective, and holds a research affiliate position at MIT CSAIL. During his career as an entrepreneur, Palmer has served as Founder, founding investor, BoD member, or advisor to more than 60 startup companies in technology, healthcare, and the life sciences. He also served as Global Head of Software and Data Engineering at Novartis Institutes for BioMedical Research (NIBR) and as a member of the start-up team and Chief Information and Administrative Officer at Infinity Pharmaceuticals (NASDAQ: INFI). Previously, he held positions at innovative technology companies Bowstreet, pcOrder.com, and Trilogy. He holds a BA from Bowdoin (1988) and an MBA from the Tuck School of Business at Dartmouth (1994).

安迪·帕夫洛

Andy Pavlo

图像

安迪·帕夫洛 (Andy Pavlo)是卡内基梅隆大学计算机科学系数据库学助理教授。他还曾经养过蛤蜊。Andy 于 2013 年获得博士学位,并于 2013 年获得硕士学位。2009年,两人均获得布朗大学博士学位,并获得硕士学位。2006 年获得罗切斯特理工学院学士学位并获得理学学士学位。

Andy Pavlo is an assistant professor of Data baseology in the Computer Science Department at Carnegie Mellon University. He also used to raise clams. Andy received a Ph,D, in 2013 and an M.Sc. in 2009, both from Brown University, and an M.Sc. in 2006 and a B.Sc., both from Rochester Institute of Technology.

亚历克斯·波利亚科夫

Alex Poliakov

图像

Alex Poliakov拥有十多年开发分布式数据库内部结构的经验。在 Paradigm4,他帮助制定 SciDB 产品的愿景,并领导一个客户解决方案专家团队,帮助科学和商业应用领域的研究人员充分利用 SciDB 为其公司创造新的见解、产品和服务。Alex 从 MIT 课程 6 毕业后曾在 Netezza 工作。Alex 热衷于驾驶无人机和制作无人机视频。

Alex Poliakov has over a decade of experience developing distributed database internals. At Paradigm4, he helps set the vision for the SciDB product and leads a team of Customer Solutions experts who help researchers in scientific and commercial applications make optimal use of SciDB to create new insights, products, and services for their companies. Alex previously worked at Netezza, after graduating from MIT’s Course 6. Alex is into flying drones and producing drone videos.

亚历山大·拉辛

Alexander Rasin

图像

Alexander Rasin是德保罗大学计算与数字媒体学院 (CDM) 的副教授。他获得了博士学位。和硕士学位 罗德岛普罗维登斯布朗大学计算机科学博士。他是 CDM 数据系统和优化实验室的联合主任,主要研究兴趣是数据库取证和取证分析的网络安全应用。Rasin 博士的其他研究项目侧重于构建和调整特定领域数据管理系统的性能,目前涉及计算机辅助诊断和软件分析领域。他目前的几个研究项目得到了 NSF 的支持。

Alexander Rasin is an Associate Professor in the College of Computing and Digital Media (CDM) at DePaul University. He received his Ph.D. and M.Sc. in Computer Science from Brown University, Providence, RI. He is a co-Director of Data Systems and Optimization Lab at CDM and his primary research interest is in database forensics and cybersecurity applications of forensic analysis. Dr. Rasin’s other research projects focus on building and tuning performance of domain-specific data management systems—currently in the areas of computer-aided diagnosis and software analytics. Several of his current research projects are supported by NSF.

珍妮·罗杰斯

Jennie Rogers

图像

Jennie Rogers是西北大学计算机科学专业的 Lisa Wissner-Slivka 和 Benjamin Slivka 初级教授以及助理教授。在此之前,她是 MIT CSAIL 数据库组的博士后助理,与 Mike Stonebraker 和 Sam Madden 一起工作。她获得了博士学位。在 Ugur Çetintemel 的指导下获得布朗大学博士学位。她的研究兴趣包括科学数据管理、联合数据库、云计算和数据库性能建模。她的 Erdös 数是 3。

Jennie Rogers is the Lisa Wissner-Slivka and Benjamin Slivka Junior Professor in Computer Science and an Assistant Professor at Northwestern University. Before that she was a postdoctoral associate in the Database Group at MIT CSAIL where she worked with Mike Stonebraker and Sam Madden. She received her Ph.D. from Brown University under the guidance of Ugur Çetintemel. Her research interests include the management of science data, federated databases, cloud computing, and database performance modeling. Her Erdös number is 3.

劳伦斯·A·罗

Lawrence A. Rowe

图像

劳伦斯·A·罗是加州大学伯克利分校电气工程和计算机科学名誉教授。他的研究兴趣是软件系统和应用。他的团队开发了伯克利讲座网络广播系统,每周制作 30 个课程讲座网络广播,每月观看人数超过 500,000 人。他的出版物获得了三项“最佳论文”奖和两项“时间考验”奖。他是伯克利种子期孵化器 The Batchery 的投资者/顾问。Rowe 是 ACM 院士、2002 年加州大学技术领导委员会 IT 创新奖的共同获得者、2007 年加州大学欧文分校唐纳德·布伦 ICS 学院杰出校友奖的获得者、2009 年 ACM SIGMM 技术成就奖的获得者,以及由于现代对象关系 DBMS 的开发而荣获首届 ACM SIGMOD 系统奖的共同获得者。

Lawrence A. Rowe is an Emeritus Professor of Electrical Engineering and Computer Science at U.C. Berkeley. His research interests are software systems and applications. His group developed the Berkeley Lecture Webcasting System that produced 30 course lecture webcasts each week viewed by over 500,000 people per month. His publications received three “best paper” and two “test of time” awards. He is an investor/advisor in The Batchery a Berkeley-based seed-stage incubator. Rowe is an ACM Fellow, a co-recipient of the 2002 U.C. Technology Leadership Council Award for IT Innovation, the recipient of the 2007 U.C. Irvine Donald Bren School of ICS Distinguished Alumni Award, the 2009 recipient of the ACM SIGMM Technical Achievement Award, and a co-recipient of the Inaugural ACM SIGMOD Systems Award for the development of modern object-relational DBMS. Larry and his wife Jean produce and sell award-winning premium wines using Napa Valley grapes under the Greyscale Wines brand.

克里蒂·森·夏尔马

Kriti Sen Sharma

图像

Kriti Sen Sharma是 Paradigm4 的客户解决方案架构师。他从事的项目涉及多个领域(基因组学、成像、可穿戴设备、金融等)。他利用协作解决问题、算法开发和编程方面的技能,构建了端到端应用程序来满足客户的大数据需求,并使他们能够快速获得业务洞察。Kriti 是一位狂热的博主,也喜欢骑自行车和徒步旅行。克里蒂获得了博士学位。2013年获得硕士学位 2009 年,毕业于弗吉尼亚理工学院和州立大学,并获得了科技学士学位。2005 年获得印度理工学院克勒格普尔分校博士学位。

Kriti Sen Sharma is a Customer Solutions Architect at Paradigm4. He works on projects spanning multiple domains (genomics, imaging, wearables, finance, etc.). Using his skills in collaborative problem-solving, algorithm development, and programming, he builds end-to-end applications that address customers’ bigdata needs and enable them to gain business insights rapidly. Kriti is an avid blogger and also loves biking and hiking. Kriti received a Ph.D. in 2013 and an M.Sc. in 2009, both from Virginia Polytechnic Institute and State University, and an a B.Tech. from Indian Institute of Technology, Kharagpur, in 2005.

南塘

Nan Tang

图像

Nan Tang是卡塔尔 HBKU 卡塔尔基金会卡塔尔计算研究所的高级科学家。他获得了博士学位。2007年于香港中文大学获得博士学位。2008年至2010年在荷兰CWI担任研究人员。2010年至2012年,他在爱丁堡大学担任研究员。他目前的研究兴趣包括数据管理、数据可视化以及智能和沉浸式数据分析。

Nan Tang is a senior scientist at Qatar Computing Research Institute, HBKU, Qatar Foundation, Qatar. He received his Ph.D. from the Chinese University of Hong Kong in 2007. He worked as a research staff member at CWI, the Netherlands, from 2008–2010. He was a research fellow at University of Edinburgh from 2010–2012. His current research interests include data curation, data visualization, and intelligent and immersive data analytics.

乔探戈

Jo Tango

图像

Jo Tango创立了 Kepha Partners。他投资了电子商务、搜索引擎、互联网广告网络、无线、供应链软件、存储、数据库、安全、在线支付和数据中心虚拟化领域。他是许多 Stonebraker 公司的创始投资者:Goby(被 NAVTEQ 收购)、Paradigm4、StreamBase Systems(被 TIBCO 收购)、Vertica Systems(被惠普收购)和 VoltDB。Jo 此前在 Highland Capital Partners 工作了近九年,担任普通合伙人。他还在贝恩公司工作了五年,在新加坡、香港和波士顿工作,专注于技术和初创项目。Jo 就读于耶鲁大学(文学学士,以优异成绩毕业)和 Phi Beta Kappa)和哈佛商学院(MBA、贝克学者)。他在jtangoVC.com上撰写个人博客。

Jo Tango founded Kepha Partners. He has invested in the e-commerce, search engine, Internet ad network, wireless, supply chain software, storage, database, security, on-line payments, and data center virtualization spaces. He has been a founding investor in many Stonebraker companies: Goby (acquired by NAVTEQ), Paradigm4, StreamBase Systems (acquired by TIBCO), Vertica Systems (acquired by Hewlett-Packard), and VoltDB. Jo previously was at Highland Capital Partners for nearly nine years, where he was a General Partner. He also spent five years with Bain & Company, where he was based in Singapore, Hong Kong, and Boston, and focused on technology and startup projects. Jo attended Yale University (B.A., summa cum laude and Phi Beta Kappa) and Harvard Business School (M.B.A., Baker Scholar). He writes a personal blog at jtangoVC.com.

内西姆·塔特布尔

Nesime Tatbul

图像

内西姆·塔特布尔是麻省理工学院 CSAIL 英特尔科学技术中心的高级研究科学家。在加入英特尔实验室之前,她是苏黎世联邦理工学院计算机科学系的一名教员。她在中东技术大学 (METU) 获得了计算机工程学士和硕士学位,并在伦敦大学获得了硕士和博士学位。布朗大学计算机科学博士。她的主要研究领域是数据库系统。她曾获得 2008 年 IBM 教员奖、2005 年 SIGMOD 最佳系统演示奖以及 2011 年 DEBS 最佳海报奖和大挑战奖。她曾在各种会议的组织和项目委员会中任职,包括 SIGMOD(作为2014 年担任工业项目联合主席,2011 年担任小组领导)、VLDB 和 ICDE(2013 年担任流、传感器网络和复杂事件处理的 PC 分会主席)。

Nesime Tatbul is a senior research scientist at the Intel Science and Technology Center at MIT CSAIL. Before joining Intel Labs, she was a faculty member at the Computer Science Department of ETH Zurich. She received her B.S. and M.S. in Computer Engineering from the Middle East Technical University (METU) and her M.S. and Ph.D. in Computer Science from Brown University. Her primary research area is database systems. She is the recipient of an IBM Faculty Award in 2008, a Best System Demonstration Award at SIGMOD 2005, and the Best Poster and the Grand Challenge awards at DEBS 2011. She has served on the organization and program committees for various conferences including SIGMOD (as an industrial program co-chair in 2014 and a group leader in 2011), VLDB, and ICDE (as a PC track chair for Streams, Sensor Networks, and Complex Event Processing in 2013).

雅陈

Nga Tran

Nga Tran目前是 Vertica 服务器开发团队的工程总监,她在该团队工作了 14 年。此前,她是一名博士学位。她是布兰代斯大学的候选人,在那里她参与的研究为迈克·斯通布雷克的研究做出了贡献。

Nga Tran is currently the Director of Engineering in the server development team at Vertica, where she has worked for the last 14 years. Previously, she was a Ph.D. candidate at Brandeis University, where she participated in research that contributed to Mike Stonebraker’s research.

玛丽安·温斯莱特

Marianne Winslett

图像

Marianne Winslett自 1987 年起担任伊利诺伊大学计算机科学系教授,并于 2009 年至 2013 年担任伊利诺伊大学新加坡研究中心高级数字科学中心主任。她的研究兴趣在于信息管理和安全,从基础设施层面到应用层面。她是 ACM 研究员,并获得美国国家科学基金会颁发的总统青年研究员奖。她是 ACM SIGMOD 的前副主席和ACM Transactions on the Web的前联合主编,并曾担任ACM Transactions on Database Systems、IEEE 的编辑委员会成员 《知识和数据工程汇刊》、《ACM 信息和系统安全汇刊》、《超大型数据库期刊》《ACM 网络汇刊》。她因管理法规遵从性数据(VLDB、SSS)的研究获得了两项最佳论文奖,一项因分析浏览器扩展以检测安全漏洞的研究而获得的最佳论文奖(USENIX Security),以及一项关键词搜索(ICDE)的最佳论文奖。她的博士学位。来自斯坦福大学。

Marianne Winslett has been a professor in the Department of Computer Science at the University of Illinois since 1987, and served as the Director of Illinois’s research center in Singapore, the Advanced Digital Sciences Center, from 2009–2013. Her research interests lie in information management and security, from the infrastructure level on up to the application level. She is an ACM Fellow and the recipient of a Presidential Young Investigator Award from the U.S. National Science Foundation. She is the former Vice-Chair of ACM SIGMOD and the former co-Editor-in-Chief of ACM Transactions on the Web, and has served on the editorial boards of ACM Transactions on Database Systems, IEEE Transactions on Knowledge and Data Engineering, ACM Transactions on Information and System Security, The Very Large Data Bases Journal, and ACM Transactions on the Web. She has received two best paper awards for research on managing regulatory compliance data (VLDB, SSS), one best paper award for research on analyzing browser extensions to detect security vulnerabilities (USENIX Security), and one for keyword search (ICDE). Her Ph.D. is from Stanford University.

黄尤金

Eugene Wong

图像

黄尤金是加州大学伯克利分校的名誉教授。他的杰出职业生涯包括对学术界、商业和公共服务的贡献。作为 EECS 系主任,他带领该系度过了最辉煌的发展时期,并成为该领域排名最高的系之一。2004 年,尤金和琼·C·王通信研究中心落成后,无线基金会在 Cory Hall 成立。他撰写或合着了 100 多篇学术文章,出版了 4 本书,指导学生并指导了 20 多篇论文。1980 年,他(与 Michael Stonebraker 和 Lawrence A. Rowe)共同创立了 INGRES Corporation。他曾担任乔治·H·布什领导下的科学技术政策办公室副主任;从 1994 年到 1996 年,他曾任香港科技大学研究及发展副校长。1988 年,他因其在 INGRES 方面的工作而获得了 ACM 软件系统奖,并于 2005 年被授予 IEEE 创始人奖章,并获得了恰当的嘉奖:“在国内和国际工程研究和技术政策方面的领导地位,在关系数据库方面的开创性贡献。”

Eugene Wong is Professor Emeritus at the University of California, Berkeley. His distinguished career includes contributions to academia, business, and public service. As Department Chair of EECS, he led the department through its greatest period of growth and into one of the highest ranked departments in its field. In 2004, the Wireless Foundation was established in Cory Hall upon completion of the Eugene and Joan C. Wong Center for Communications Research. He authored or co-authored over 100 scholarly articles and published 4 books, mentored students, and supervised over 20 dissertations. In 1980, he co-founded (with Michael Stonebraker and Lawrence A. Rowe) the INGRES Corporation. He was the Associate Director of the Office of Science and Technology Policy, under George H. Bush; from 1994–1996, he was Vice President for Research and Development for Hong Kong University of Science and Technology. He received the ACM Software System Award in 1988 for his work on INGRES, and was awarded the 2005 IEEE Founders Medal, with the apt citation: “For leadership in national and international engineering research and technology policy, for pioneering contributions in relational databases.”

斯坦·兹多尼克

Stan Zdonik

图像

斯坦·兹多尼克是布朗大学计算机科学终身教授,也是数据库管理系统领域的著名研究员。他的大部分工作涉及将数据管理技术应用于新颖的数据库架构,以支持新的应用程序。他是 Aurora 和 Borealis 流处理引擎、C-Store 列存储 DBMS 和 H-Store NewSQL DBMS 的联合开发者,并为包括 SciDB 和 BigDAWG Polystore 系统在内的其他系统做出了贡献。他(与 Michael Stonebraker)共同创立了两家初创公司:StreamBase Systems 和 Vertica Systems。早些时候,在 Bolt Beranek 和 Newman Inc. 工作期间,Zdonik 博士致力于 Prophet 系统的开发,这是一种药理学家数据管理工具。他在数据库领域发表了 150 多篇经过同行评审的论文,并于 2006 年被任命为 ACM 院士。Zdonik 博士拥有美国计算机科学学士学位。他获得了麻省理工学院的计算机科学硕士学位和工业管理硕士学位、计算机科学硕士学位和电气工程师学位,并在那里获得了博士学位。Michael Hammer 教授指导下获得数据库管理博士学位。

Stan Zdonik is a tenured professor of Computer Science at Brown University and a noted researcher in database management systems. Much of his work involves applying data management techniques to novel database architectures, to enable new applications. He is co-developer of the Aurora and Borealis stream processing engines, C-Store column store DBMS, and H-Store NewSQL DBMS, and has contributed to other systems including SciDB and the BigDAWG polystore system. He co-founded (with Michael Stonebraker) two startup companies: StreamBase Systems and Vertica Systems. Earlier, while at Bolt Beranek and Newman Inc., Dr. Zdonik worked on the Prophet System, a data management tool for pharmacologists. He has more than 150 peer-reviewed papers in the database field and was named an ACM Fellow in 2006. Dr. Zdonik has a B.S in Computer Science and one in Industrial Management, an M.S. in Computer Science, and the degree of Electrical Engineer, all from MIT, where he went on to receive his Ph.D. in database management under Prof. Michael Hammer.